Commit graph

918 commits

Author SHA1 Message Date
Robert Nishihara
3b7788bf88 Disallow calling ray.put on an object ID. (#353) 2017-03-11 12:09:28 -08:00
Richard Liaw
b463d9e5c7 Initial A3C Example - PongDeterministic-v3 (#331)
* Initializing A3C code

* Modifications for Ray usage

* cleanup

* removing universe dependency

* fixes (not yet working

* hack

* documentation

* Cleanup

* Preliminary Portion

Make sure to change when merging

* RL part

* Cleaning up Driver and Worker code

* Updating driver code

* instructions...

* fixed

* Minor changes.

* Fixing cmake issues

* ray instruction

* updating port to new universe

* Fix for env.configure

* redundant commands

* Revert scipy.misc -> cv2 and raise exception for wrong gym version.
2017-03-11 00:57:53 -08:00
Robert Nishihara
53dffe0bf2 Use flatbuffers for some messages from Redis. (#341)
* Compile the Ray redis module with C++.

* Redo parsing of object table notifications with flatbuffers.

* Update redis module python tests.

* Redo parsing of task table notifications with flatbuffers.

* Fix linting.

* Redo parsing of db client notifications with flatbuffers.

* Redo publishing of local scheduler heartbeats with flatbuffers.

* Fix linting.

* Remove usage of fixed-width formatting of scheduling state in channel name.

* Reply with flatbuffer object to task table queries, also simplify redis string to flatbuffer string conversion.

* Fix linting and tests.

* fix

* cleanup

* simplify logic in ReplyWithTask
2017-03-10 18:35:25 -08:00
Philipp Moritz
555dcf35a2 Add policy gradient example. (#344)
* add policy gradient example

* fix typos

* Minor changes plus some documentation.

* Minor fixes.
2017-03-07 23:42:44 -08:00
Philipp Moritz
0de57be085 upgrade flatbuffers to 1.6.0 (#345) 2017-03-07 21:33:46 -08:00
Robert Nishihara
d001a50644 Add link to the code for the resnet example. (#343) 2017-03-07 13:14:00 -08:00
Wapaul1
c66178bcd7 Resnet Adapted to Ray (#229)
* Initial conversion

* Further changes

* fixes

* some changes

* Fixes

* Added data pipeline

* Added updates to cifar

* Currently borken need sep pr

* Added test for retriving variables from an optimizer

* Removed FlAG ref in environment variables

* Added comments to test

* Addressed comments

* Added updates

* Made further changes for tfutils

* Fixed finalized bug

* Removed ipython

* Added accuracy printing

* Temp commit

* added fixes

* changes

* Added writing to file

* Fixes for gpus

* Cleaned up code

* Temp commit

* Gpu support fully implemented

* Updated to use num_gpus for actors

* Finished testing gpus implementation

* Changed to be more in line with origin implementation

* Updated test to use actors

* Added support for cpu only systems

* Now works with no cpus

* Minor changes and some documentation.
2017-03-07 01:07:32 -08:00
Stephanie Wang
da06b4db82 Warn the user when a nondeterministic task is detected. (#339)
* WARN instead of FATAL for object hash mismatches, push error to driver

* Document the callback signature for object_table_add/remove

* Error table

* Wait for all errors in python test

* Fix doc

* Fix state test
2017-03-07 00:32:15 -08:00
Philipp Moritz
0b8d279ef2 Convert task_spec to flatbuffers (#255)
* convert Ray to C++

* convert task_spec to flatbuffers

* fix

* it compiles

* latest

* tests are passing

* task2 -> task

* fix

* fix

* fix

* fix

* fix

* linting

* fix valgrind

* upgrade flatbuffers

* use debug mode for valgrind

* fix naming and comments

* downgrade flatbuffers

* fix linting

* reintroduce TaskSpec_free

* rename TaskSpec -> TaskInfo

* refactoring

* linting
2017-03-05 02:05:02 -08:00
Robert Nishihara
65a8659f3d Some plasma manager transfer optimizations. (#334)
* Change tranfer queue to doubly-linked list to speed up append.

* Maintain set of pending transfers to make deduplication easy.

* Fix naming convention for structs in plasma manager.
2017-03-04 23:15:17 -08:00
Robert Nishihara
95bf81aeb8 Add actor tutorial. (#335) 2017-03-04 23:06:02 -08:00
Robert Nishihara
a7ddac6fb1 Properly mock ray submodules when building documentation. (#337) 2017-03-04 23:02:56 -08:00
Robert Nishihara
0a233b7144 Update hyperparameter optimization example. (#332)
* Update hyperparameter optimization example.

* Remove early stopping.
2017-03-04 10:45:15 -08:00
Stephanie Wang
41b8675d04 Availability after local scheduler failure (#329)
* Clean up plasma subscribers on EPIPE

First pass at a monitoring script - monitor can detect local scheduler death

Clean up task table upon local scheduler death in monitoring script

Don't schedule to dead local schedulers in global scheduler

Have global scheduler update the db clients table, monitor script cleans up state

Documentation

Monitor script should scan tables before beginning to read from subscription channel

Fix for python3

Redirect monitor output to redis logs, fix hanging in multinode tests

* Publish auxiliary addresses as part of db_client deletion notifications

* Fix test case?

* Small changes.

* Use SCAN instead of KEYS

* Address comments

* Address more comments

* Free redis module strings
2017-03-02 19:51:20 -08:00
Alexey Tumanov
4f9e74469e Fix segfault induced by getting more than 200k objects (#333)
* [RAY-567]: allocate memory on the heap for large gets

* linting
2017-03-02 01:35:10 -08:00
Robert Nishihara
6a4bde54dc Only install ray python packages. (#330)
* Only install ray python packages.

* Add some __init__.py files.

* Install Ray before building documentation.

* Fix install-ray.sh.

* Fix.
2017-03-01 23:34:44 -08:00
Robert Nishihara
39b7abefc5 Fix test failures in actor_test.py. (#317) 2017-03-01 23:26:39 -08:00
Philipp Moritz
793a102846 Make Ray code C++ compatible (#321)
* convert Ray to C++
* const correctness
2017-03-01 01:17:24 -08:00
Johann Schleier-Smith
ad4b03bf7f Docker Updates (#308)
* new path for python build

* add flag

* build tar using git archive

* no exit from start_ray.sh

* update Docker instructions

* update build docker script

* add git revision

* fix typo

* bug fixes and clarifications

* mend

* add objectmanager ports to docker instructions

* rewording

* Small updates to documentation.
2017-02-28 18:57:51 -08:00
Alexey Tumanov
b91d9cba45 Adding flatbuffers and migrating flatcc to flatbuffers for plasma (#325)
* adding flatbuffers and migrating flatcc to flatbuffers for plasma

* variable name changes in plasma_protocol and plasma flatbuffers schema

* quick fix

* cleanups and remove flatcc

* more cleanup

* add doc

* linting

* fix linting

* fix mac os x build

* linting

* cleanup

* c++ fix for plasma flatbuffers

* Remove flatcc from CMakeLists.txt.

* linting; trigger travis
2017-02-28 18:47:40 -08:00
Robert Nishihara
1a997ed279 Move documentation to ReadTheDocs. (#326) 2017-02-27 21:14:31 -08:00
Robert Nishihara
1ae7e7d29e Rename photon -> local scheduler. (#322) 2017-02-27 12:24:07 -08:00
Philipp Moritz
a30eed452e Change type naming convention. (#315)
* Rename object_id -> ObjectID.

* Rename ray_logger -> RayLogger.

* rename task_id -> TaskID, actor_id -> ActorID, function_id -> FunctionID

* Rename plasma_store_info -> PlasmaStoreInfo.

* Rename plasma_store_state -> PlasmaStoreState.

* Rename plasma_object -> PlasmaObject.

* Rename object_request -> ObjectRequests.

* Rename eviction_state -> EvictionState.

* Bug fix.

* rename db_handle -> DBHandle

* Rename local_scheduler_state -> LocalSchedulerState.

* rename db_client_id -> DBClientID

* rename task -> Task

* make redis.c C++ compatible

* Rename scheduling_algorithm_state -> SchedulingAlgorithmState.

* Rename plasma_connection -> PlasmaConnection.

* Rename client_connection -> ClientConnection.

* Fixes from rebase.

* Rename local_scheduler_client -> LocalSchedulerClient.

* Rename object_buffer -> ObjectBuffer.

* Rename client -> Client.

* Rename notification_queue -> NotificationQueue.

* Rename object_get_requests -> ObjectGetRequests.

* Rename get_request -> GetRequest.

* Rename object_info -> ObjectInfo.

* Rename scheduler_object_info -> SchedulerObjectInfo.

* Rename local_scheduler -> LocalScheduler and some fixes.

* Rename local_scheduler_info -> LocalSchedulerInfo.

* Rename global_scheduler_state -> GlobalSchedulerState.

* Rename global_scheduler_policy_state -> GlobalSchedulerPolicyState.

* Rename object_size_entry -> ObjectSizeEntry.

* Rename aux_address_entry -> AuxAddressEntry.

* Rename various ID helper methods.

* Rename Task helper methods.

* Rename db_client_cache_entry -> DBClientCacheEntry.

* Rename local_actor_info -> LocalActorInfo.

* Rename actor_info -> ActorInfo.

* Rename retry_info -> RetryInfo.

* Rename actor_notification_table_subscribe_data -> ActorNotificationTableSubscribeData.

* Rename local_scheduler_table_send_info_data -> LocalSchedulerTableSendInfoData.

* Rename table_callback_data -> TableCallbackData.

* Rename object_info_subscribe_data -> ObjectInfoSubscribeData.

* Rename local_scheduler_table_subscribe_data -> LocalSchedulerTableSubscribeData.

* Rename more redis call data structures.

* Rename photon_conn PhotonConnection.

* Rename photon_mock -> PhotonMock.

* Fix formatting errors.
2017-02-26 00:32:43 -08:00
Stephanie Wang
be1618f041 Availability after worker failure (#316)
* Availability after a killed worker

* Workers exit cleanly

* Memory cleanup in photon C tests

* Worker failure in multinode

* Consolidate worker cleanup handlers

* Update the result table before handling a task submission

* KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS

* Log a warning instead of crashing if no result table entry found
2017-02-25 20:19:36 -08:00
Robert Nishihara
232601f90d Change all table calls to use default retry behavior. (#312)
* Change all table calls to use default retry behavior and change default retry behavior.

* Add warning for table retries.
2017-02-24 12:41:32 -08:00
Robert Nishihara
aa174e6311 Fix global scheduler test failure. (#314) 2017-02-24 11:05:45 -08:00
Robert Nishihara
7f5be96683 Remove object table tests that are failing. (#310) 2017-02-23 13:39:59 -08:00
Alexey Tumanov
3159a78ad7 terminate photon task dispatch early when workers or resources are unavailable (#311)
* terminate photon task dispatch early when no workers or resources available

* style
2017-02-23 00:05:16 -08:00
Robert Nishihara
54238c4ad0 Propagate errors from importing actors. (#309)
* Propagate errors from importing actors.

* Fix bug.
2017-02-22 15:15:45 -08:00
Robert Nishihara
a6bf16f6a9 Make global scheduler periodically resubmit tasks that can't be sched… (#306)
* Make global scheduler periodically resubmit tasks that can't be scheduled because their resource requirements are not met.

* Address comments and fix bug.

* Rename impossible_tasks -> pending_tasks.

* Fix formatting.
2017-02-21 23:15:46 -08:00
Robert Nishihara
e399f57e6b Let actors use GPUs. (#302)
* Add num_cpus and num_gpus to actor decorator.

* Assign GPU IDs to actors.

* Add additional actor test.

* Remove duplicated line.

* Factor out local scheduler selection method.

* Add test and simplify local scheduler selection.
2017-02-21 01:13:04 -08:00
Robert Nishihara
3e67d28922 Address numbuf compiler warnings. (#300) 2017-02-20 22:42:03 -08:00
Stephanie Wang
334aed9fa9 Fetch the object after requesting reconstruction during ray.get (#301)
* Fetch the object after requesting reconstruction during ray.get

* revert

* Fix documentation and memory leak

* Fix hanging reconstruction bug

* Fix for python3
2017-02-20 21:41:34 -08:00
Robert Nishihara
2220a33b62 In UI, add timing information for tasks and show cluster scheduling. (#297)
* In UI, add timing information for tasks and show cluster scheduling.

* Factor out html generation as function.
2017-02-19 15:12:08 -08:00
Robert Nishihara
124baa7472 Fix bug in redis module tests. (#292)
* Fix bug in redis module tests.

* Sleep while waiting for next message.
2017-02-18 00:55:57 -08:00
Robert Nishihara
abd9987e3b Fix unreliable actor test. (#295) 2017-02-18 00:51:08 -08:00
Stephanie Wang
67c591c33b Retry connections in photon connect, consolidate code in io.c (#294) 2017-02-17 23:41:21 -08:00
Philipp Moritz
9973a6e37c fix bug in numbuf serialization (#296) 2017-02-17 23:35:41 -08:00
Stephanie Wang
a0dd3a44c0 Dynamically grow worker pool to partially solve hanging workloads (#286)
* First pass at a policy to solve deadlock

* Address Robert's comments

* stress test

* unit test

* Fix test cases

* Fix test for python3

* add more logging

* White space.
2017-02-17 17:08:52 -08:00
Robert Nishihara
0bbf08a4ac Fix test_illegal_put failure in plasma test. (#289)
* Fix test_illegal_put failure in plasma test.

* Check that exactly one plasma manager has died.
2017-02-17 11:06:25 -08:00
Johann Schleier-Smith
c9bc488ee0 Redirect process output to log files (#267)
* redirect process output to log files

* formatting fixes

* Generate all log files in start_ray_processes.

* Fix bug.
2017-02-16 20:34:45 -08:00
Philipp Moritz
dd7e8d9105 Avoid segfaults in arrow if data is too large (#287)
* arrow limits

* more logging

* set the right limit

* update

* simplify

* fix

* account for subsequences

* fixes and deactivate arrow limit tests in travis

* fixes

* Minor formatting.

* Add a couple more tests.
2017-02-16 15:16:20 -08:00
Robert Nishihara
88a5b4e77b Simplify imports and exports and provide driver isolation for remote functions. (#288)
* Remove import counter and export counter.

* Provide isolation between drivers for remote functions.

* Add test for driver function isolation.

* Hash source code into function ID to reduce likelihood of collisions.

* Fix failure test example.

* Replace assertTrue with assertIn to improve failure messages in tests.

* Fix failure test.
2017-02-16 11:30:35 -08:00
Wapaul1
883f945db4 Updated tfutils to use new op naming (#284)
* Updated tfutils to use new op naming

* Reverted tensorflow 12.0.0
2017-02-15 17:47:53 -08:00
Philipp Moritz
12a68e84d2 Implement a first pass at actors in the API. (#242)
* Implement actor field for tasks

* Implement actor management in local scheduler.

* initial python frontend for actors

* import actors on worker

* IPython code completion and tests

* prepare creating actors through local schedulers

* add actor id to PyTask

* submit actor calls to local scheduler

* starting to integrate

* simple fix

* Fixes from rebasing.

* more work on python actors

* Improve local scheduler actor handlers.

* Pass actor ID to local scheduler when connecting a client.

* first working version of actors

* fixing actors

* fix creating two copies of the same actor

* fix actors

* remove sleep

* get rid of export synchronization

* update

* insert actor methods into the queue in the right order

* remove print statements

* make it compile again after rebase

* Minor updates.

* fix python actor ids

* Pass actor_id to start_worker.

* add test

* Minor changes.

* Update actor tests.

* Temporary plan for import counter.

* Temporarily fix import counters.

* Fix some tests.

* Fixes.

* Make actor creation non-blocking.

* Fix test?

* Fix actors on Python 2.

* fix rare case.

* Fix python 2 test.

* More tests.

* Small fixes.

* Linting.

* Revert tensorflow version to 0.12.0 temporarily.

* Small fix.

* Enhance inheritance test.
2017-02-15 00:10:05 -08:00
Robert Nishihara
072eadd57f Pipe num_cpus and num_gpus through from start_ray.py. (#275)
* Pipe num_cpus and num_gpus through from start_ray.py.

* Improve load balancing tests.

* Fix bug.

* Factor out some testing code.
2017-02-13 17:43:23 -08:00
Robert Nishihara
3934d5f6eb Remove old files and remove old documentation for copying files around cluster. (#274) 2017-02-13 11:20:04 -08:00
Robert Nishihara
cb7f6ca9b5 Attempt to start web UI when starting Ray. (#269)
* Attempt to start web UI when starting Ray.

* Add instructions for using web UI to cluster documentation.

* Don't check if port 8080 is open.

* Remove print statement.
2017-02-12 15:17:58 -08:00
Robert Nishihara
f6ce9dfa6c Allow start_ray.sh to take an object manager port. (#272)
* Allow start_ray.sh to take a object manager port.

* Fix typo and add test.

* Small cleanups.
2017-02-12 12:39:32 -08:00
Johann Schleier-Smith
7bf80b6b22 bug fix on printing exception traceback (#268) 2017-02-10 21:05:05 -08:00