Commit graph

847 commits

Author SHA1 Message Date
Robert Nishihara
320109a5bd By default, start a number of workers equal to the number of CPUs. (#430)
* By default, start a number of workers equal to the number of CPUs.

* Fix stress tests.
2017-04-06 00:02:58 -07:00
Robert Nishihara
fa363a5a3a Notify driver when a worker dies while executing a task. (#419)
* Notify driver when a worker dies while executing a task.

* Fix linting.

* Don't push error when local scheduler is cleaning up.
2017-04-06 00:02:39 -07:00
Robert Nishihara
85b373a4be Suppress warning in start_ray.sh about leaving child processes running when parent exits. (#429) 2017-04-05 23:54:22 -07:00
Stephanie Wang
93679df724 Stopped nodes can rejoin immediately (#428)
* Ignore deleted clients when reading address info from Redis

* Remove self from db_client table when exiting cleanly

* Fix valgrind test

* Do not call plasma_perform_release when disconnecting
2017-04-05 23:50:38 -07:00
Philipp Moritz
4043769ba2 Make putting large objects work. (#411)
* putting large objects

* add more checks

* support large objects

* fix test

* fix linting

* upgrade to latest arrow version

* check malloc return code

* print mmap file sizes

* printing

* revert to dlmalloc

* add prints

* more prints

* add printing

* printing

* fix

* update

* fix

* update

* print

* initialization

* temp

* fix

* update

* fix linting

* comment out object_store_full tests

* fix test

* fix test

* evict objects if dlmalloc fails

* fix stresstests

* Fix linting.

* Uncomment large-memory tests.

* Increase memory for docker image for jenkins tests.

* Reduce large memory tests.

* Further reduce large memory tests.
2017-04-05 01:04:05 -07:00
Robert Nishihara
1e84747e13 Remove incorrect check. (#421) 2017-04-03 14:51:53 -07:00
Richard Shin
227c916c25 Convert plasma/plasma_store.cc to use STL (#324)
* Change plasma_store.c to C++ (clobbering existing FlatBuffers usage).

* Convert plasma_store.cc to use STL (with a caveat)

* Fix CMakeLists and mutation-while-iterating problem

* Remove extra extern "C" declarations

* Remove redundant -std=c++11 from plasma/CMakeLists.txt
2017-03-31 22:58:10 -07:00
Robert Nishihara
f1b48f2fd4 Avoid publishing in the task table unnecessarily. (#416) 2017-03-30 13:41:32 -07:00
Stephanie Wang
036b873bf2 Implement local scheduler task queues using C++ data structures (#392)
* Switch to using C++ lists for task queues

* Init and free methods for TaskQueueEntry

* Switch from utarray to c++ vector for TaskQueueEntry

* Get rid of some pointers

* Back to O(1) deletion from waiting_task_queue

* Fix comments

* Cut code

* Non const iterators

* Fix Alexey's comments
2017-03-30 00:40:01 -07:00
Robert Nishihara
8245758ccb Add overview of internals to documentation, improve serialization doc… (#390)
* Add overview of internals to documentation, improve serialization documentation.

* Doc edits

* Small fixes.
2017-03-27 21:52:17 -07:00
Alexey Tumanov
78e1167a42 Parallelize make in build.sh. (#371)
* parallelize build.sh make

* Encode in cmake the dependency of ray_redis_module on autogenerated flatbuffer files.
2017-03-27 20:55:50 -07:00
Robert Nishihara
0925e11c48 Exclude function source from function ID hash in Python interpreter. (#395)
* Exclude function source code from function ID hash in Python interpreter.

* Remove try except block.
2017-03-25 11:31:21 -07:00
Robert Nishihara
054a046b69 Fix installation instructions on Ubuntu and convert md -> rst. (#389) 2017-03-24 17:33:26 -07:00
Alexey Tumanov
a3d58607bf parallelize numbuf memcpy and plasma object hash construction (#366)
* parallelizing memcopy and object hash construction in numbuf/plasma

* clang format

* whitespace

* refactoring compute object hash: get rid of the prefix chunk

* clang format

* Document performance optimization.

* Remove check for 64-byte alignment, since it may not be guaranteed.
2017-03-21 16:17:35 -07:00
Robert Nishihara
ba02fc0eb0 Run flake8 in Travis and make code PEP8 compliant. (#387) 2017-03-21 12:57:54 -07:00
Stephanie Wang
083e7a28ad Push an error to the driver when the workload hangs on ray.put reconstruction (#382)
* Fix worker blocked bug

* tmp

* Push an error to the driver on ray.put for non-driver tasks

* Fix result table tests

* Fix test, logging

* Address comments

* Fix suppression bug

* Fix redis module test

* Edit error message

* Get values in chunks during reconstruction

* Test case for driver ray.put errors

* Error for evicting ray.put objects from the driver

* Fix tests

* Reduce verbosity

* Documentation
2017-03-21 00:16:48 -07:00
Philipp Moritz
4618fd45b1 Port Ray to latest Arrow version (#370)
* rebase on top of latest arrow

* clang-format

* address comments

* fix
2017-03-20 16:31:46 -07:00
Johann Schleier-Smith
29c8471fd4 Add multinode tests by simulating multiple nodes using Docker. (#378)
* run test workloads for a Docker cluster

* better manage docker image versions

* Changes to make multinode docker tests work with Python 3.

* option to mount local test directory on head node to speed development

* Attempt to simplify multinode test setup.

* Small change.

* Add in development-mode to run multinode docker tests more easily during development.

* add jenkins test script that links to Docker hash

* Read docker SHA from build_docker.sh and add test that should fail.

* Consolidate implementations and remove duplicate files.

* Allow test to retry if it fails to schedule on all nodes.

* Remove sleep when in docker multinode tests.
2017-03-18 23:44:54 -07:00
Wapaul1
6d9820ef5d Added tensorboard to resnet (#374)
Added tensorboard to resnet example.
2017-03-17 18:36:23 -07:00
Stephanie Wang
12c9618c0c Plasma and worker node failure. (#373)
* Failing test case

* Local scheduler exits cleanly after plasma store dies

* Tolerate one plasma store failure

* Tolerate plasma store failures on all nodes except head node

* Plasma manager heartbeats

* Component failure tests

* Don't run the helper for Python testing

* Fix C test

* Fix hanging plasma transfer test

* Fix python3

* Consolidate ClientConnection code

* Fix valgrind test

* fix c test

* We can restart worker nodes!

* Fix flatbuffers bug

* Address comments

* Only register actual workers with the local scheduler

* Fix bug

* Fix segfaults

* Add test case that tests for driver liveness, fix local scheduler bug

* Clean up after tests

* Allocate retry info on the stack

* Send SIGKILL before waiting

* Relax unit test conditions

* Driver liveness test case and documentation
2017-03-17 17:03:58 -07:00
Robert Nishihara
964d5cac48 Expand API documentation. (#375)
* Expand API documentation and convert tutorial to rst.

* Fix formatting error in tutorial.

* Address William's comments.

* Address Stephanie's comments.
2017-03-17 16:48:25 -07:00
Robert Nishihara
6b1e8caf2d Reduce stress_test verbosity. (#377) 2017-03-16 20:10:56 -07:00
Robert Nishihara
f1d4dda8cb Put all log files in redis and visualize them in UI. (#350)
* Start process for monitoring log files and push changes to redis.

* Display log files in UI.

* Bug fix for recent tasks.

* Use flatbuffers to parse local scheduler heartbeats.
2017-03-16 15:27:00 -07:00
Robert Nishihara
3333e1d6b9 Fix bug in parsing of tasks in monitor. (#372) 2017-03-15 20:32:23 -07:00
Philipp Moritz
068429ffd8 Convert local scheduler messages to flatbuffers (#340)
* use flatbuffer messages for local scheduler

* make sure constructor gets called for C++ object ObjectInfoT

* fix typo

* fix Robert's comments

* Small change to actor test.

* fix valgrind error

* linting

* free notification

* fix

* valgrind

* fix valgrind

* fix other bugs

* valgrind fix

* fixes

* more fixes

* Small changes to comments.
2017-03-15 16:27:52 -07:00
Philipp Moritz
4af0aa6258 Atari on pixels (#364)
* pong on pixels working (not cleaned up)

* make training compatible with all atari games

* cartpole runs

* Update documentation and usage for policy gradients.
2017-03-14 13:31:29 -07:00
Robert Nishihara
99583f5b08 Clean up rl_pong example. (#365)
* Clean up RL pong example.

* More troubleshooting instructions.

* Typo.

* Fix typo.
2017-03-11 21:16:36 -08:00
Richard Liaw
ced13ca5b1 Error Messages - UI display (#360)
* backend ui work

* instll block

* easy mvp

* Small changes.
2017-03-11 18:43:06 -08:00
Wapaul1
b1cb48159a Examples updated with actors. (#358)
* Updated examples with actors

* Small changes, and convert documentation from MD to RST.
2017-03-11 15:30:31 -08:00
Robert Nishihara
3b7788bf88 Disallow calling ray.put on an object ID. (#353) 2017-03-11 12:09:28 -08:00
Richard Liaw
b463d9e5c7 Initial A3C Example - PongDeterministic-v3 (#331)
* Initializing A3C code

* Modifications for Ray usage

* cleanup

* removing universe dependency

* fixes (not yet working

* hack

* documentation

* Cleanup

* Preliminary Portion

Make sure to change when merging

* RL part

* Cleaning up Driver and Worker code

* Updating driver code

* instructions...

* fixed

* Minor changes.

* Fixing cmake issues

* ray instruction

* updating port to new universe

* Fix for env.configure

* redundant commands

* Revert scipy.misc -> cv2 and raise exception for wrong gym version.
2017-03-11 00:57:53 -08:00
Robert Nishihara
53dffe0bf2 Use flatbuffers for some messages from Redis. (#341)
* Compile the Ray redis module with C++.

* Redo parsing of object table notifications with flatbuffers.

* Update redis module python tests.

* Redo parsing of task table notifications with flatbuffers.

* Fix linting.

* Redo parsing of db client notifications with flatbuffers.

* Redo publishing of local scheduler heartbeats with flatbuffers.

* Fix linting.

* Remove usage of fixed-width formatting of scheduling state in channel name.

* Reply with flatbuffer object to task table queries, also simplify redis string to flatbuffer string conversion.

* Fix linting and tests.

* fix

* cleanup

* simplify logic in ReplyWithTask
2017-03-10 18:35:25 -08:00
Philipp Moritz
555dcf35a2 Add policy gradient example. (#344)
* add policy gradient example

* fix typos

* Minor changes plus some documentation.

* Minor fixes.
2017-03-07 23:42:44 -08:00
Philipp Moritz
0de57be085 upgrade flatbuffers to 1.6.0 (#345) 2017-03-07 21:33:46 -08:00
Robert Nishihara
d001a50644 Add link to the code for the resnet example. (#343) 2017-03-07 13:14:00 -08:00
Wapaul1
c66178bcd7 Resnet Adapted to Ray (#229)
* Initial conversion

* Further changes

* fixes

* some changes

* Fixes

* Added data pipeline

* Added updates to cifar

* Currently borken need sep pr

* Added test for retriving variables from an optimizer

* Removed FlAG ref in environment variables

* Added comments to test

* Addressed comments

* Added updates

* Made further changes for tfutils

* Fixed finalized bug

* Removed ipython

* Added accuracy printing

* Temp commit

* added fixes

* changes

* Added writing to file

* Fixes for gpus

* Cleaned up code

* Temp commit

* Gpu support fully implemented

* Updated to use num_gpus for actors

* Finished testing gpus implementation

* Changed to be more in line with origin implementation

* Updated test to use actors

* Added support for cpu only systems

* Now works with no cpus

* Minor changes and some documentation.
2017-03-07 01:07:32 -08:00
Stephanie Wang
da06b4db82 Warn the user when a nondeterministic task is detected. (#339)
* WARN instead of FATAL for object hash mismatches, push error to driver

* Document the callback signature for object_table_add/remove

* Error table

* Wait for all errors in python test

* Fix doc

* Fix state test
2017-03-07 00:32:15 -08:00
Philipp Moritz
0b8d279ef2 Convert task_spec to flatbuffers (#255)
* convert Ray to C++

* convert task_spec to flatbuffers

* fix

* it compiles

* latest

* tests are passing

* task2 -> task

* fix

* fix

* fix

* fix

* fix

* linting

* fix valgrind

* upgrade flatbuffers

* use debug mode for valgrind

* fix naming and comments

* downgrade flatbuffers

* fix linting

* reintroduce TaskSpec_free

* rename TaskSpec -> TaskInfo

* refactoring

* linting
2017-03-05 02:05:02 -08:00
Robert Nishihara
65a8659f3d Some plasma manager transfer optimizations. (#334)
* Change tranfer queue to doubly-linked list to speed up append.

* Maintain set of pending transfers to make deduplication easy.

* Fix naming convention for structs in plasma manager.
2017-03-04 23:15:17 -08:00
Robert Nishihara
95bf81aeb8 Add actor tutorial. (#335) 2017-03-04 23:06:02 -08:00
Robert Nishihara
a7ddac6fb1 Properly mock ray submodules when building documentation. (#337) 2017-03-04 23:02:56 -08:00
Robert Nishihara
0a233b7144 Update hyperparameter optimization example. (#332)
* Update hyperparameter optimization example.

* Remove early stopping.
2017-03-04 10:45:15 -08:00
Stephanie Wang
41b8675d04 Availability after local scheduler failure (#329)
* Clean up plasma subscribers on EPIPE

First pass at a monitoring script - monitor can detect local scheduler death

Clean up task table upon local scheduler death in monitoring script

Don't schedule to dead local schedulers in global scheduler

Have global scheduler update the db clients table, monitor script cleans up state

Documentation

Monitor script should scan tables before beginning to read from subscription channel

Fix for python3

Redirect monitor output to redis logs, fix hanging in multinode tests

* Publish auxiliary addresses as part of db_client deletion notifications

* Fix test case?

* Small changes.

* Use SCAN instead of KEYS

* Address comments

* Address more comments

* Free redis module strings
2017-03-02 19:51:20 -08:00
Alexey Tumanov
4f9e74469e Fix segfault induced by getting more than 200k objects (#333)
* [RAY-567]: allocate memory on the heap for large gets

* linting
2017-03-02 01:35:10 -08:00
Robert Nishihara
6a4bde54dc Only install ray python packages. (#330)
* Only install ray python packages.

* Add some __init__.py files.

* Install Ray before building documentation.

* Fix install-ray.sh.

* Fix.
2017-03-01 23:34:44 -08:00
Robert Nishihara
39b7abefc5 Fix test failures in actor_test.py. (#317) 2017-03-01 23:26:39 -08:00
Philipp Moritz
793a102846 Make Ray code C++ compatible (#321)
* convert Ray to C++
* const correctness
2017-03-01 01:17:24 -08:00
Johann Schleier-Smith
ad4b03bf7f Docker Updates (#308)
* new path for python build

* add flag

* build tar using git archive

* no exit from start_ray.sh

* update Docker instructions

* update build docker script

* add git revision

* fix typo

* bug fixes and clarifications

* mend

* add objectmanager ports to docker instructions

* rewording

* Small updates to documentation.
2017-02-28 18:57:51 -08:00
Alexey Tumanov
b91d9cba45 Adding flatbuffers and migrating flatcc to flatbuffers for plasma (#325)
* adding flatbuffers and migrating flatcc to flatbuffers for plasma

* variable name changes in plasma_protocol and plasma flatbuffers schema

* quick fix

* cleanups and remove flatcc

* more cleanup

* add doc

* linting

* fix linting

* fix mac os x build

* linting

* cleanup

* c++ fix for plasma flatbuffers

* Remove flatcc from CMakeLists.txt.

* linting; trigger travis
2017-02-28 18:47:40 -08:00
Robert Nishihara
1a997ed279 Move documentation to ReadTheDocs. (#326) 2017-02-27 21:14:31 -08:00