Commit graph

797 commits

Author SHA1 Message Date
Robert Nishihara
1ae7e7d29e Rename photon -> local scheduler. (#322) 2017-02-27 12:24:07 -08:00
Philipp Moritz
a30eed452e Change type naming convention. (#315)
* Rename object_id -> ObjectID.

* Rename ray_logger -> RayLogger.

* rename task_id -> TaskID, actor_id -> ActorID, function_id -> FunctionID

* Rename plasma_store_info -> PlasmaStoreInfo.

* Rename plasma_store_state -> PlasmaStoreState.

* Rename plasma_object -> PlasmaObject.

* Rename object_request -> ObjectRequests.

* Rename eviction_state -> EvictionState.

* Bug fix.

* rename db_handle -> DBHandle

* Rename local_scheduler_state -> LocalSchedulerState.

* rename db_client_id -> DBClientID

* rename task -> Task

* make redis.c C++ compatible

* Rename scheduling_algorithm_state -> SchedulingAlgorithmState.

* Rename plasma_connection -> PlasmaConnection.

* Rename client_connection -> ClientConnection.

* Fixes from rebase.

* Rename local_scheduler_client -> LocalSchedulerClient.

* Rename object_buffer -> ObjectBuffer.

* Rename client -> Client.

* Rename notification_queue -> NotificationQueue.

* Rename object_get_requests -> ObjectGetRequests.

* Rename get_request -> GetRequest.

* Rename object_info -> ObjectInfo.

* Rename scheduler_object_info -> SchedulerObjectInfo.

* Rename local_scheduler -> LocalScheduler and some fixes.

* Rename local_scheduler_info -> LocalSchedulerInfo.

* Rename global_scheduler_state -> GlobalSchedulerState.

* Rename global_scheduler_policy_state -> GlobalSchedulerPolicyState.

* Rename object_size_entry -> ObjectSizeEntry.

* Rename aux_address_entry -> AuxAddressEntry.

* Rename various ID helper methods.

* Rename Task helper methods.

* Rename db_client_cache_entry -> DBClientCacheEntry.

* Rename local_actor_info -> LocalActorInfo.

* Rename actor_info -> ActorInfo.

* Rename retry_info -> RetryInfo.

* Rename actor_notification_table_subscribe_data -> ActorNotificationTableSubscribeData.

* Rename local_scheduler_table_send_info_data -> LocalSchedulerTableSendInfoData.

* Rename table_callback_data -> TableCallbackData.

* Rename object_info_subscribe_data -> ObjectInfoSubscribeData.

* Rename local_scheduler_table_subscribe_data -> LocalSchedulerTableSubscribeData.

* Rename more redis call data structures.

* Rename photon_conn PhotonConnection.

* Rename photon_mock -> PhotonMock.

* Fix formatting errors.
2017-02-26 00:32:43 -08:00
Stephanie Wang
be1618f041 Availability after worker failure (#316)
* Availability after a killed worker

* Workers exit cleanly

* Memory cleanup in photon C tests

* Worker failure in multinode

* Consolidate worker cleanup handlers

* Update the result table before handling a task submission

* KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS

* Log a warning instead of crashing if no result table entry found
2017-02-25 20:19:36 -08:00
Robert Nishihara
232601f90d Change all table calls to use default retry behavior. (#312)
* Change all table calls to use default retry behavior and change default retry behavior.

* Add warning for table retries.
2017-02-24 12:41:32 -08:00
Robert Nishihara
aa174e6311 Fix global scheduler test failure. (#314) 2017-02-24 11:05:45 -08:00
Robert Nishihara
7f5be96683 Remove object table tests that are failing. (#310) 2017-02-23 13:39:59 -08:00
Alexey Tumanov
3159a78ad7 terminate photon task dispatch early when workers or resources are unavailable (#311)
* terminate photon task dispatch early when no workers or resources available

* style
2017-02-23 00:05:16 -08:00
Robert Nishihara
54238c4ad0 Propagate errors from importing actors. (#309)
* Propagate errors from importing actors.

* Fix bug.
2017-02-22 15:15:45 -08:00
Robert Nishihara
a6bf16f6a9 Make global scheduler periodically resubmit tasks that can't be sched… (#306)
* Make global scheduler periodically resubmit tasks that can't be scheduled because their resource requirements are not met.

* Address comments and fix bug.

* Rename impossible_tasks -> pending_tasks.

* Fix formatting.
2017-02-21 23:15:46 -08:00
Robert Nishihara
e399f57e6b Let actors use GPUs. (#302)
* Add num_cpus and num_gpus to actor decorator.

* Assign GPU IDs to actors.

* Add additional actor test.

* Remove duplicated line.

* Factor out local scheduler selection method.

* Add test and simplify local scheduler selection.
2017-02-21 01:13:04 -08:00
Robert Nishihara
3e67d28922 Address numbuf compiler warnings. (#300) 2017-02-20 22:42:03 -08:00
Stephanie Wang
334aed9fa9 Fetch the object after requesting reconstruction during ray.get (#301)
* Fetch the object after requesting reconstruction during ray.get

* revert

* Fix documentation and memory leak

* Fix hanging reconstruction bug

* Fix for python3
2017-02-20 21:41:34 -08:00
Robert Nishihara
2220a33b62 In UI, add timing information for tasks and show cluster scheduling. (#297)
* In UI, add timing information for tasks and show cluster scheduling.

* Factor out html generation as function.
2017-02-19 15:12:08 -08:00
Robert Nishihara
124baa7472 Fix bug in redis module tests. (#292)
* Fix bug in redis module tests.

* Sleep while waiting for next message.
2017-02-18 00:55:57 -08:00
Robert Nishihara
abd9987e3b Fix unreliable actor test. (#295) 2017-02-18 00:51:08 -08:00
Stephanie Wang
67c591c33b Retry connections in photon connect, consolidate code in io.c (#294) 2017-02-17 23:41:21 -08:00
Philipp Moritz
9973a6e37c fix bug in numbuf serialization (#296) 2017-02-17 23:35:41 -08:00
Stephanie Wang
a0dd3a44c0 Dynamically grow worker pool to partially solve hanging workloads (#286)
* First pass at a policy to solve deadlock

* Address Robert's comments

* stress test

* unit test

* Fix test cases

* Fix test for python3

* add more logging

* White space.
2017-02-17 17:08:52 -08:00
Robert Nishihara
0bbf08a4ac Fix test_illegal_put failure in plasma test. (#289)
* Fix test_illegal_put failure in plasma test.

* Check that exactly one plasma manager has died.
2017-02-17 11:06:25 -08:00
Johann Schleier-Smith
c9bc488ee0 Redirect process output to log files (#267)
* redirect process output to log files

* formatting fixes

* Generate all log files in start_ray_processes.

* Fix bug.
2017-02-16 20:34:45 -08:00
Philipp Moritz
dd7e8d9105 Avoid segfaults in arrow if data is too large (#287)
* arrow limits

* more logging

* set the right limit

* update

* simplify

* fix

* account for subsequences

* fixes and deactivate arrow limit tests in travis

* fixes

* Minor formatting.

* Add a couple more tests.
2017-02-16 15:16:20 -08:00
Robert Nishihara
88a5b4e77b Simplify imports and exports and provide driver isolation for remote functions. (#288)
* Remove import counter and export counter.

* Provide isolation between drivers for remote functions.

* Add test for driver function isolation.

* Hash source code into function ID to reduce likelihood of collisions.

* Fix failure test example.

* Replace assertTrue with assertIn to improve failure messages in tests.

* Fix failure test.
2017-02-16 11:30:35 -08:00
Wapaul1
883f945db4 Updated tfutils to use new op naming (#284)
* Updated tfutils to use new op naming

* Reverted tensorflow 12.0.0
2017-02-15 17:47:53 -08:00
Philipp Moritz
12a68e84d2 Implement a first pass at actors in the API. (#242)
* Implement actor field for tasks

* Implement actor management in local scheduler.

* initial python frontend for actors

* import actors on worker

* IPython code completion and tests

* prepare creating actors through local schedulers

* add actor id to PyTask

* submit actor calls to local scheduler

* starting to integrate

* simple fix

* Fixes from rebasing.

* more work on python actors

* Improve local scheduler actor handlers.

* Pass actor ID to local scheduler when connecting a client.

* first working version of actors

* fixing actors

* fix creating two copies of the same actor

* fix actors

* remove sleep

* get rid of export synchronization

* update

* insert actor methods into the queue in the right order

* remove print statements

* make it compile again after rebase

* Minor updates.

* fix python actor ids

* Pass actor_id to start_worker.

* add test

* Minor changes.

* Update actor tests.

* Temporary plan for import counter.

* Temporarily fix import counters.

* Fix some tests.

* Fixes.

* Make actor creation non-blocking.

* Fix test?

* Fix actors on Python 2.

* fix rare case.

* Fix python 2 test.

* More tests.

* Small fixes.

* Linting.

* Revert tensorflow version to 0.12.0 temporarily.

* Small fix.

* Enhance inheritance test.
2017-02-15 00:10:05 -08:00
Robert Nishihara
072eadd57f Pipe num_cpus and num_gpus through from start_ray.py. (#275)
* Pipe num_cpus and num_gpus through from start_ray.py.

* Improve load balancing tests.

* Fix bug.

* Factor out some testing code.
2017-02-13 17:43:23 -08:00
Robert Nishihara
3934d5f6eb Remove old files and remove old documentation for copying files around cluster. (#274) 2017-02-13 11:20:04 -08:00
Robert Nishihara
cb7f6ca9b5 Attempt to start web UI when starting Ray. (#269)
* Attempt to start web UI when starting Ray.

* Add instructions for using web UI to cluster documentation.

* Don't check if port 8080 is open.

* Remove print statement.
2017-02-12 15:17:58 -08:00
Robert Nishihara
f6ce9dfa6c Allow start_ray.sh to take an object manager port. (#272)
* Allow start_ray.sh to take a object manager port.

* Fix typo and add test.

* Small cleanups.
2017-02-12 12:39:32 -08:00
Johann Schleier-Smith
7bf80b6b22 bug fix on printing exception traceback (#268) 2017-02-10 21:05:05 -08:00
Stephanie Wang
2b8e6485e3 Start and clean up workers from the local scheduler. (#250)
* Start and clean up workers from the local scheduler

Ability to kill workers in photon scheduler

Test for old method of starting workers

Common codepath for killing workers

Common codepath for killing workers

Photon test case for starting and killing workers

fix build

Fix component failure test

Register a worker's pid as part of initial connection

Address comments and revert photon_connect

Set PATH during travis install

Fix

* Fix photon test case to accept clients on plasma manager fd
2017-02-10 12:46:23 -08:00
Robert Nishihara
ec175b7dfb Check if processes are alive in test. (#261) 2017-02-09 23:40:39 -08:00
Robert Nishihara
249b667b0e Raise exception in Python if wait is called with duplicate object IDs. (#262) 2017-02-09 23:32:19 -08:00
Robert Nishihara
0aa234fb9c Fix CXX numbuf error message for Anaconda 3.6. (#258) 2017-02-09 23:29:43 -08:00
Johann Schleier-Smith
883bedf46e Add documentation for upgrading a Ray cluster. (#256)
* add documentation for upgrading a Ray cluster

* Update documentation and link to it from README.
2017-02-09 11:55:37 -08:00
Alexey Tumanov
dfb6107b22 General attribute-based heterogeneity support with hard and soft constraints (#248)
* attribute-based heterogeneity-awareness in global scheduler and photon

* minor post-rebase fix

* photon: enforce dynamic capacity constraint on task dispatch

* globalsched: cap the number of times we try to schedule a task in round robin

* propagating ability to specify resource capacity to ray.init

* adding resources to remote function export and fetch/register

* globalsched: remove unused functions; update cached photon resource capacity (until next photon heartbeat)

* Add some integration tests.

* globalsched: cleanup + factor out constraint checking

* lots of style

* task_spec_required_resource: global refactor

* clang format

* clang format + comment update in photon

* clang format photon comment

* valgrind

* reduce verbosity for Travis

* Add test for scheduler load balancing.

* addressing comments

* refactoring global scheduler algorithm

* Minor cleanups.

* Linting.

* Fix array_test.py and linting.

* valgrind fix for photon tests

* Attempt to fix stress tests.

* fix hashmap free

* fix hashmap free comment

* memset photon resource vectors to 0 in case they get used before the first heartbeat

* More whitespace changes.

* Undo whitespace error I introduced.
2017-02-09 01:34:14 -08:00
Wapaul1
1a7e1c47cb Added example for compute grads in ray tutorial (#238)
* Added example for compute grads in ray

* Added formatting

* Removed need for placeholders in apply gradient

* Streamlined examples

* Fixed docs

* Added formatting

* Removed old references

* Simplified code some

* Addressed comments

* Changes to first code block

* Added test for training and updated code snippets

* Formatting

* Removed mean

* Removed all mention of mean

* Added comments

* Added comments
2017-02-07 18:07:21 -08:00
Robert Nishihara
1fec94ef00 Display drivers in web UI. (#252)
* Display drivers in web UI.

* Display more rows in grid and factor out function in webui backend.
2017-02-07 14:21:25 -08:00
Philipp Moritz
fefc7d9b49 fix segfault in photon.Task (#253) 2017-02-07 11:17:11 -08:00
Robert Nishihara
2d1c980ad7 Refactor local scheduler to remove worker indices. (#245)
* Refactor local scheduler to remove worker indices.

* Change scheduling state enum to int in all function signatures.

* Bug fix, don't use pointers into a resizable array.

* Remove total_num_workers.

* Fix tests.
2017-02-05 14:52:28 -08:00
Philipp Moritz
ca254b8689 Fix stack overflow if many objects are fetched. (#237)
* fix stack overflow if many objects are fetched

* fix other stack allocations

* add tests and fix linting

* address stephanie's comments

* fix linting

* fix tests
2017-02-04 16:49:36 -08:00
Johann Schleier-Smith
e5a9fc0032 Cluster setup instructions (#233)
* start updating cluster documentation with parallel ssh

* add using ray on a large cluster

* revert changes to using ray on a cluster

* update cluster documentation

* update title

* Some formatting changes, and added some notes.

* clarification

* Add warning about public versus private IP addresses.

* Typos and wording.

* Clarifications.

* Clarifications.
2017-02-02 16:10:26 -08:00
Robert Nishihara
7a7e14ef85 Visualize recent tasks in timeline. (#240) 2017-02-02 15:53:56 -08:00
Stephanie Wang
241b539ff8 Reconstruction for evicted objects (#181)
* First pass at reconstruction in the worker

Modify reconstruction stress testing to start Plasma service before rest of Ray cluster

TODO about reconstructing ray.puts

Fix ray.put error for double creates

Distinguish between empty entry and no entry in object table

Fix test case

Fix Python test

Fix tests

* Only call reconstruct on objects we have not yet received

* Address review comments

* Fix reconstruction for Python3

* remove unused code

* Address Robert's comments, stress tests are crashing

* Test and update the task's scheduling state to suppress duplicate
reconstruction requests.

* Split result table into two lookups, one for task ID and the other as a
test-and-set for the task state

* Fix object table tests

* Fix redis module result_table_lookup test case

* Multinode reconstruction tests

* Fix python3 test case

* rename

* Use new start_redis

* Remove unused code

* lint

* indent

* Address Robert's comments

* Use start_redis from ray.services in state table tests

* Remove unnecessary memset
2017-02-01 19:18:46 -08:00
Robert Nishihara
f69d4aaaa7 Change fetch requests in plasma manager to use a single timer. (#234)
* Change fetch requests in plasma manager to use a single timer.

* Fix manager tests, other cleanups.
2017-02-01 12:21:52 -08:00
Johann Schleier-Smith
6ad2b5d87a Add Redis port option to startup script (#232)
* specify redis address when starting head

* cleanup

* update starting cluster documentation

* Whitespace.

* Address Philipp's comments.

* Change redis_host -> redis_ip_address.
2017-01-31 00:28:00 -08:00
Wapaul1
db7297865f Added functionality for retrieving variables from control dependencies (#220)
* Added test for retriving variables from an optimizer

* Added comments to test

* Addressed comments

* Fixed travis bug

* Added fix to circular controls

* Added set for explored operations and duplicate prefix stripping

* Removed embeded ipython

* Removed prefix, use seperate graph for each network

* Removed redundant imports

* Addressed comments and added separate graph to initializer

* fix typos

* get rid of prefix in documentation
2017-01-30 19:17:42 -08:00
Robert Nishihara
6703f7be6f Provide functionality for local scheduler to start new workers. (#230)
* Provide functionality for local scheduler to start new workers.

* Pass full command for starting new worker in to local scheduler.

* Separate out configuration state of local scheduler.
2017-01-27 01:28:48 -08:00
Stephanie Wang
a5c8f28f33 Plasma subscribe (#227)
* Use object_info as notification, not just the object_id

* Add a regression test for plasma managers connecting to store after some objects have been created

* Send notifications for existing objects to new plasma subscribers

* Continuously try the request to the plasma manager instead of setting a timeout in the test case

* Use ray.services to start Redis in plasma test cases

* fix test case
2017-01-25 22:57:15 -08:00
Robert Nishihara
ab8c3432f7 Add driver ID to task spec and add driver ID to Python error handling. (#225)
* Add driver ID to task spec and add driver ID to Python error handling.

* Make constants global variables.

* Add test for error isolation.
2017-01-25 22:53:48 -08:00
Stephanie Wang
3c6686db08 Photon optimizations (#219)
* Optimizations:
- Track mapping of missing object to dependent tasks to avoid iterating over task queue
- Perform all fetch requests for missing objects using the same timer

* Fix bug and add regression test

* Record task dependencies and active fetch requests in the same hash table

* fix typo

* Fix memory leak and add test cases for scheduling when dependencies are evicted

* Fix python3 test case

* Minor details.
2017-01-23 19:44:15 -08:00