Commit graph

673 commits

Author SHA1 Message Date
Robert Nishihara
f6ce9dfa6c Allow start_ray.sh to take an object manager port. (#272)
* Allow start_ray.sh to take a object manager port.

* Fix typo and add test.

* Small cleanups.
2017-02-12 12:39:32 -08:00
Johann Schleier-Smith
7bf80b6b22 bug fix on printing exception traceback (#268) 2017-02-10 21:05:05 -08:00
Stephanie Wang
2b8e6485e3 Start and clean up workers from the local scheduler. (#250)
* Start and clean up workers from the local scheduler

Ability to kill workers in photon scheduler

Test for old method of starting workers

Common codepath for killing workers

Common codepath for killing workers

Photon test case for starting and killing workers

fix build

Fix component failure test

Register a worker's pid as part of initial connection

Address comments and revert photon_connect

Set PATH during travis install

Fix

* Fix photon test case to accept clients on plasma manager fd
2017-02-10 12:46:23 -08:00
Robert Nishihara
249b667b0e Raise exception in Python if wait is called with duplicate object IDs. (#262) 2017-02-09 23:32:19 -08:00
Robert Nishihara
0aa234fb9c Fix CXX numbuf error message for Anaconda 3.6. (#258) 2017-02-09 23:29:43 -08:00
Alexey Tumanov
dfb6107b22 General attribute-based heterogeneity support with hard and soft constraints (#248)
* attribute-based heterogeneity-awareness in global scheduler and photon

* minor post-rebase fix

* photon: enforce dynamic capacity constraint on task dispatch

* globalsched: cap the number of times we try to schedule a task in round robin

* propagating ability to specify resource capacity to ray.init

* adding resources to remote function export and fetch/register

* globalsched: remove unused functions; update cached photon resource capacity (until next photon heartbeat)

* Add some integration tests.

* globalsched: cleanup + factor out constraint checking

* lots of style

* task_spec_required_resource: global refactor

* clang format

* clang format + comment update in photon

* clang format photon comment

* valgrind

* reduce verbosity for Travis

* Add test for scheduler load balancing.

* addressing comments

* refactoring global scheduler algorithm

* Minor cleanups.

* Linting.

* Fix array_test.py and linting.

* valgrind fix for photon tests

* Attempt to fix stress tests.

* fix hashmap free

* fix hashmap free comment

* memset photon resource vectors to 0 in case they get used before the first heartbeat

* More whitespace changes.

* Undo whitespace error I introduced.
2017-02-09 01:34:14 -08:00
Wapaul1
1a7e1c47cb Added example for compute grads in ray tutorial (#238)
* Added example for compute grads in ray

* Added formatting

* Removed need for placeholders in apply gradient

* Streamlined examples

* Fixed docs

* Added formatting

* Removed old references

* Simplified code some

* Addressed comments

* Changes to first code block

* Added test for training and updated code snippets

* Formatting

* Removed mean

* Removed all mention of mean

* Added comments

* Added comments
2017-02-07 18:07:21 -08:00
Robert Nishihara
1fec94ef00 Display drivers in web UI. (#252)
* Display drivers in web UI.

* Display more rows in grid and factor out function in webui backend.
2017-02-07 14:21:25 -08:00
Stephanie Wang
241b539ff8 Reconstruction for evicted objects (#181)
* First pass at reconstruction in the worker

Modify reconstruction stress testing to start Plasma service before rest of Ray cluster

TODO about reconstructing ray.puts

Fix ray.put error for double creates

Distinguish between empty entry and no entry in object table

Fix test case

Fix Python test

Fix tests

* Only call reconstruct on objects we have not yet received

* Address review comments

* Fix reconstruction for Python3

* remove unused code

* Address Robert's comments, stress tests are crashing

* Test and update the task's scheduling state to suppress duplicate
reconstruction requests.

* Split result table into two lookups, one for task ID and the other as a
test-and-set for the task state

* Fix object table tests

* Fix redis module result_table_lookup test case

* Multinode reconstruction tests

* Fix python3 test case

* rename

* Use new start_redis

* Remove unused code

* lint

* indent

* Address Robert's comments

* Use start_redis from ray.services in state table tests

* Remove unnecessary memset
2017-02-01 19:18:46 -08:00
Johann Schleier-Smith
6ad2b5d87a Add Redis port option to startup script (#232)
* specify redis address when starting head

* cleanup

* update starting cluster documentation

* Whitespace.

* Address Philipp's comments.

* Change redis_host -> redis_ip_address.
2017-01-31 00:28:00 -08:00
Wapaul1
db7297865f Added functionality for retrieving variables from control dependencies (#220)
* Added test for retriving variables from an optimizer

* Added comments to test

* Addressed comments

* Fixed travis bug

* Added fix to circular controls

* Added set for explored operations and duplicate prefix stripping

* Removed embeded ipython

* Removed prefix, use seperate graph for each network

* Removed redundant imports

* Addressed comments and added separate graph to initializer

* fix typos

* get rid of prefix in documentation
2017-01-30 19:17:42 -08:00
Robert Nishihara
6703f7be6f Provide functionality for local scheduler to start new workers. (#230)
* Provide functionality for local scheduler to start new workers.

* Pass full command for starting new worker in to local scheduler.

* Separate out configuration state of local scheduler.
2017-01-27 01:28:48 -08:00
Stephanie Wang
a5c8f28f33 Plasma subscribe (#227)
* Use object_info as notification, not just the object_id

* Add a regression test for plasma managers connecting to store after some objects have been created

* Send notifications for existing objects to new plasma subscribers

* Continuously try the request to the plasma manager instead of setting a timeout in the test case

* Use ray.services to start Redis in plasma test cases

* fix test case
2017-01-25 22:57:15 -08:00
Robert Nishihara
ab8c3432f7 Add driver ID to task spec and add driver ID to Python error handling. (#225)
* Add driver ID to task spec and add driver ID to Python error handling.

* Make constants global variables.

* Add test for error isolation.
2017-01-25 22:53:48 -08:00
Stephanie Wang
3c6686db08 Photon optimizations (#219)
* Optimizations:
- Track mapping of missing object to dependent tasks to avoid iterating over task queue
- Perform all fetch requests for missing objects using the same timer

* Fix bug and add regression test

* Record task dependencies and active fetch requests in the same hash table

* fix typo

* Fix memory leak and add test cases for scheduling when dependencies are evicted

* Fix python3 test case

* Minor details.
2017-01-23 19:44:15 -08:00
Richard Liaw
4575cd88b2 Improve error messages when nodes can't communicate with each other. (#223)
* Good error messages when nodes can't communicate with each other

* Print more information when starting the head node.

* Change retries back to 5.
2017-01-22 14:53:15 -08:00
Robert Nishihara
9bb8162621 Improvements to documentation and error messages. (#221) 2017-01-19 20:27:46 -08:00
Robert Nishihara
b98a63fd3a Change get to take a timeout and multiple object IDs. (#212)
* Change plasma_get to take a timeout and an array of object IDs.

* Address comments.

* Bug fix related to computing object hashes.

* Add test.

* Fix file descriptor leak.

* Fix valgrind.

* Formatting.

* Remove call to plasma_contains from the plasma client. Use timeout internally in ray.get.

* small fixes
2017-01-19 12:21:12 -08:00
Robert Nishihara
58570b4981 Give better error message if ray.put or ray.get are called before ray.init. (#214) 2017-01-18 22:37:37 -08:00
Stephanie Wang
f1987cdc16 Split local scheduler task queue (#211)
* Split local scheduler task queue into waiting and dispatch queue

* Fix memory leak

* Add a new task scheduling status for when a task has been queued locally

* Fix global scheduler test case and add task status doc

* Documentation

* Address Philipp's comments

* Move tasks back to the waiting queue if their dependencies become unavailable

* Update existing task table entries instead of overwriting
2017-01-18 20:27:40 -08:00
Wapaul1
6fe69bec11 Selects from all variables now independent of graph, and uses standar… (#199)
* Smarter variable retrieval and doc update

* doc update and small fixes

* addressing robert's comments
2017-01-18 17:36:58 -08:00
Robert Nishihara
303d0fed3e Prevent plasma store and manager from dying when a client dies. (#203)
* Prevent plasma store and manager from dying when a worker dies.

* Check errno inside of warn_if_sigpipe. Passing in errno doesn't work because the arguments to warn_if_sigpipe can be evaluated out of order.
2017-01-17 20:34:31 -08:00
Philipp Moritz
a708e36225 Switch build system to use CMake completely. (#200)
* switch to CMake completely

...

* cleanup

* Run C tests, update installation instructions.
2017-01-17 16:56:40 -08:00