Commit graph

132 commits

Author SHA1 Message Date
Philipp Moritz
28f0882387 Expose function table to python global control state API (#542)
* expose function table to python global control state API

* fix

* fix linting

* add test for function table
2017-05-16 20:06:13 -07:00
Robert Nishihara
ec2534422b Remove register_class from API. (#550)
* Perform ray.register_class under the hood.

* Fix bug.

* Release worker lock when waiting for imports to arrive in get.

* Remove calls to register_class from examples and tests.

* Clear serialization state between tests.

* Fix bug and add test for multiple custom classes with same name.

* Fix failure test.

* Fix linting and cleanups to python code.

* Fixes to documentation.

* Implement recursion depth for recursively registering classes.

* Fix linting.

* Push warning to user if waiting for class for too long.

* Fix typos.

* Don't export FunctionToRun if pickling the function fails.

* Don't broadcast class definition when pickling class.
2017-05-16 18:38:52 -07:00
Eric Liang
e2e9e4ce6f Fix segmentation fault when calling ray.put on a dictionary with object keys (#548)
* fix segfault when serializing dict key

* fix style

* fix test

* Fix linting.
2017-05-15 01:09:13 -07:00
Robert Nishihara
c688a64235 Expose GPU IDs to remote functions. (#496)
* Change local scheduler bookkeeping to use GPU IDs.

* Update actor test.

* Add tests for actors and tasks simultaneously using GPUs.

* Add additional task GPU ID test.

* Fix linting.

* Make redis GPU assignment ignore GPU IDs.

* Small fix.
2017-05-07 13:03:49 -07:00
Robert Nishihara
8532ba4272 Serialize lambdas, sets, and types with pickle by default. (#511)
* Serialize lambdas with pickle by default.

* Serialize sets with pickle by default.

* Serialize types with pickle by default.

* Small update to documentation.

* Update tests.
2017-05-04 00:16:35 -07:00
Robert Nishihara
0ac125e9b2 Clean up when a driver disconnects. (#462)
* Clean up state when drivers exit.

* Remove unnecessary field in ActorMapEntry struct.

* Have monitor release GPU resources in Redis when driver exits.

* Enable multiple drivers in multi-node tests and test driver cleanup.

* Make redis GPU allocation a redis transaction and small cleanups.

* Fix multi-node test.

* Small cleanups.

* Make global scheduler take node_ip_address so it appears in the right place in the client table.

* Cleanups.

* Fix linting and cleanups in local scheduler.

* Fix removed_driver_test.

* Fix bug related to vector -> list.

* Fix linting.

* Cleanup.

* Fix multi node tests.

* Fix jenkins tests.

* Add another multi node test with many drivers.

* Fix linting.

* Make the actor creation notification a flatbuffer message.

* Revert "Make the actor creation notification a flatbuffer message."

This reverts commit af99099c8084dbf9177fb4e34c0c9b1a12c78f39.

* Add comment explaining flatbuffer problems.
2017-04-24 18:10:21 -07:00
Philipp Moritz
8ac6c59931 Remove n^2 algorithm in plasma get (#466)
Remove n^2 algorithm in plasma get.
2017-04-17 23:37:33 -07:00
Richard Liaw
c3a2505ffd Loadbalancing Test issue (#452)
* Limiting number of CPUs in loadbalancing test

* fixes as requested
2017-04-11 22:33:58 -07:00
Robert Nishihara
f4c1adae17 Unify function signature handling between remote functions and actor … (#441)
* Unify function signature handling between remote functions and actor methods.

* Fixes.

* Fix tests.
2017-04-08 21:34:13 -07:00
Robert Nishihara
7af6f462fb Add API for querying global control state. (#431)
* Add API for querying global control state.

* Fix linting.

* Fix errors in Python 2.

* Fix bug in test.

* Fix bug in test.
2017-04-06 23:51:12 -07:00
Robert Nishihara
ba02fc0eb0 Run flake8 in Travis and make code PEP8 compliant. (#387) 2017-03-21 12:57:54 -07:00
Robert Nishihara
3b7788bf88 Disallow calling ray.put on an object ID. (#353) 2017-03-11 12:09:28 -08:00
Stephanie Wang
a0dd3a44c0 Dynamically grow worker pool to partially solve hanging workloads (#286)
* First pass at a policy to solve deadlock

* Address Robert's comments

* stress test

* unit test

* Fix test cases

* Fix test for python3

* add more logging

* White space.
2017-02-17 17:08:52 -08:00
Robert Nishihara
88a5b4e77b Simplify imports and exports and provide driver isolation for remote functions. (#288)
* Remove import counter and export counter.

* Provide isolation between drivers for remote functions.

* Add test for driver function isolation.

* Hash source code into function ID to reduce likelihood of collisions.

* Fix failure test example.

* Replace assertTrue with assertIn to improve failure messages in tests.

* Fix failure test.
2017-02-16 11:30:35 -08:00
Philipp Moritz
12a68e84d2 Implement a first pass at actors in the API. (#242)
* Implement actor field for tasks

* Implement actor management in local scheduler.

* initial python frontend for actors

* import actors on worker

* IPython code completion and tests

* prepare creating actors through local schedulers

* add actor id to PyTask

* submit actor calls to local scheduler

* starting to integrate

* simple fix

* Fixes from rebasing.

* more work on python actors

* Improve local scheduler actor handlers.

* Pass actor ID to local scheduler when connecting a client.

* first working version of actors

* fixing actors

* fix creating two copies of the same actor

* fix actors

* remove sleep

* get rid of export synchronization

* update

* insert actor methods into the queue in the right order

* remove print statements

* make it compile again after rebase

* Minor updates.

* fix python actor ids

* Pass actor_id to start_worker.

* add test

* Minor changes.

* Update actor tests.

* Temporary plan for import counter.

* Temporarily fix import counters.

* Fix some tests.

* Fixes.

* Make actor creation non-blocking.

* Fix test?

* Fix actors on Python 2.

* fix rare case.

* Fix python 2 test.

* More tests.

* Small fixes.

* Linting.

* Revert tensorflow version to 0.12.0 temporarily.

* Small fix.

* Enhance inheritance test.
2017-02-15 00:10:05 -08:00
Robert Nishihara
072eadd57f Pipe num_cpus and num_gpus through from start_ray.py. (#275)
* Pipe num_cpus and num_gpus through from start_ray.py.

* Improve load balancing tests.

* Fix bug.

* Factor out some testing code.
2017-02-13 17:43:23 -08:00
Stephanie Wang
2b8e6485e3 Start and clean up workers from the local scheduler. (#250)
* Start and clean up workers from the local scheduler

Ability to kill workers in photon scheduler

Test for old method of starting workers

Common codepath for killing workers

Common codepath for killing workers

Photon test case for starting and killing workers

fix build

Fix component failure test

Register a worker's pid as part of initial connection

Address comments and revert photon_connect

Set PATH during travis install

Fix

* Fix photon test case to accept clients on plasma manager fd
2017-02-10 12:46:23 -08:00
Robert Nishihara
249b667b0e Raise exception in Python if wait is called with duplicate object IDs. (#262) 2017-02-09 23:32:19 -08:00
Alexey Tumanov
dfb6107b22 General attribute-based heterogeneity support with hard and soft constraints (#248)
* attribute-based heterogeneity-awareness in global scheduler and photon

* minor post-rebase fix

* photon: enforce dynamic capacity constraint on task dispatch

* globalsched: cap the number of times we try to schedule a task in round robin

* propagating ability to specify resource capacity to ray.init

* adding resources to remote function export and fetch/register

* globalsched: remove unused functions; update cached photon resource capacity (until next photon heartbeat)

* Add some integration tests.

* globalsched: cleanup + factor out constraint checking

* lots of style

* task_spec_required_resource: global refactor

* clang format

* clang format + comment update in photon

* clang format photon comment

* valgrind

* reduce verbosity for Travis

* Add test for scheduler load balancing.

* addressing comments

* refactoring global scheduler algorithm

* Minor cleanups.

* Linting.

* Fix array_test.py and linting.

* valgrind fix for photon tests

* Attempt to fix stress tests.

* fix hashmap free

* fix hashmap free comment

* memset photon resource vectors to 0 in case they get used before the first heartbeat

* More whitespace changes.

* Undo whitespace error I introduced.
2017-02-09 01:34:14 -08:00
Robert Nishihara
9bb8162621 Improvements to documentation and error messages. (#221) 2017-01-19 20:27:46 -08:00
Robert Nishihara
b98a63fd3a Change get to take a timeout and multiple object IDs. (#212)
* Change plasma_get to take a timeout and an array of object IDs.

* Address comments.

* Bug fix related to computing object hashes.

* Add test.

* Fix file descriptor leak.

* Fix valgrind.

* Formatting.

* Remove call to plasma_contains from the plasma client. Use timeout internally in ray.get.

* small fixes
2017-01-19 12:21:12 -08:00
Robert Nishihara
87d8d05792 Rename reusable variables -> environment variables. (#195) 2017-01-10 20:14:33 -08:00
Robert Nishihara
be4a37bf37 Various cleanups: remove start_ray_local from ray.init, remove unused code, fix "pip install numbuf". (#193)
* Remove start_ray_local from ray.init and change default number of workers to 10.

* Remove alexnet example.

* Move array methods to experimental.

* Remove TRPO example.

* Remove old files.

* Compile plasma when we build numbuf.

* Address comments.
2017-01-10 17:35:27 -08:00
Robert Nishihara
973716d310 Use cloudpickle 0.2.2. (#189) 2017-01-08 17:30:06 -08:00
Robert Nishihara
651aa6007a Log profiling information from worker. (#178)
* Log timing events on workers.

* Have workers log to the event log through the local scheduler.

* Fixes and address comments.

* bug fix

* styling
2017-01-05 16:47:16 -08:00
Robert Nishihara
8d90c9f432 Experimental utils for copying directories to other machines in the c… (#150)
* Experimental utils for copying directories to other machines in the cluster using Ray.

* Test copying directory functionality.

* Small fix.
2016-12-23 00:43:16 -08:00
Robert Nishihara
86b211f5c2 Give run_function_on_all_workers to take a worker_info dictionary including a counter. (#149)
* Suppress Redis warnings and remove some global scheduler logging.

* Pass a counter into run_function_on_all_workers indicating how many workers have begun executing this function.
2016-12-22 22:05:58 -08:00
Robert Nishihara
79dd1815a2 Python 3 compatibility. (#121)
* Make common module Python 3 compatible.

* Make plasma module Python 3 compatible.

* Make photon module Python 3 compatible.

* Make numbuf module Python 3 compatible.

* Remaining changes for Python 3 compatibility.

* Test Python 3 in Travis.

* Fixes.
2016-12-16 14:40:37 -08:00
Robert Nishihara
ddba1df802 Start working toward Python3 compatibility. (#117) 2016-12-11 12:25:31 -08:00
Philipp Moritz
58e8bbcb34 Fix bug in serializing arguments of tasks that are more complex objects (#72)
* Give more informative error message when we do not know how to serialize a class.

* Check that passing arguments to remote functions and getting them does not change their values.

* fix serialization bug

* fix tests for common module

* Formatting.

* Bug fix in init_pickle_module signature.

* Use pickle with HIGHEST_PROTOCOL.
2016-11-30 23:21:53 -08:00
Robert Nishihara
d77b685a90 Global scheduler skeleton (#45)
* Initial scheduler commit

* global scheduler

* add global scheduler

* Implement global scheduler skeleton.

* Formatting.

* Allow local scheduler to be started without a connection to redis so that we can test it without a global scheduler.

* Fail if there are no local schedulers when the global scheduler receives a task.

* Initialize uninitialized value and formatting fix.

* Generalize local scheduler table to db client table.

* Remove code duplication in local scheduler and add flag for whether a task came from the global scheduler or not.

* Queue task specs in the local scheduler instead of tasks.

* Simple global scheduler tests, including valgrind.

* Factor out functions for starting processes.

* Fixes.
2016-11-18 19:57:51 -08:00
Robert Nishihara
336a904404 Implement repr, hash, and richcompare for ObjectIDs. (#33)
* Implement repr, hash, and richcompare for ObjectIDs.

* Addressing comments.

* Partially fix example applications.
2016-11-11 09:18:36 -08:00
Robert Nishihara
90f88af902 Fix bug in which worker import counters were treated incorrectly. (#28)
* Fix bug in which worker import counters were treated incorrectly.

* Fix bug in which cached functions-to-run were double counted as exports. This also runs the functions-to-run on the driver only after ray.init is called.

* Only define reusable variables locally after ray.init has been called.

* Remove flaky reference counting tests. It's not clear that these tests make sense.

* Make numbuf pip install verbose.

* Export cached reusable variables before cached remote functions.

* Fix bug causing the worker to hang sometimes. This happens when the worker is trying to run a task, but it hasn't imported enough imports to run the task, so it continually acquires and releases a lock while checking if it has enough imports. However, for some reason, the import thread is waiting to acquire the same lock and never does so (or takes a very long time to do so). By dropping the lock before sleeping, this makes it easier for other threads to acquire the lock.

* Acquire locks using 'with' statements.

* Fix possible test failure.

* Try to start Redis multiple times with different random ports if the original attempt failed.

* Fix test in which we redefine a remote function.
2016-11-06 22:24:39 -08:00
Robert Nishihara
072f442c1f Update worker.py and services.py to use plasma and the local scheduler. (#19)
* Update worker code and services code to use plasma and the local scheduler.

* Cleanups.

* Fix bug in which threads were started before the worker mode was set. This caused remote functions to be defined on workers before the worker knew it was in WORKER_MODE.

* Fix bug in install-dependencies.sh.

* Lengthen timeout in failure_test.py.

* Cleanups.

* Cleanup services.start_ray_local.

* Clean up random name generation.

* Cleanups.
2016-11-02 00:39:35 -07:00
Robert Nishihara
09a3ff7173 Pip install numbuf. (#8) 2016-10-28 14:30:20 -07:00
Robert Nishihara
0a44145906 Fix the resetting of reusable variables on the driver and cache functions to run on all workers. (#446)
* Properly reset reusable variables on the driver when remote functions are run locally on the driver.

* Cache functions to run on all workers that occur before ray.init is called.
2016-10-12 22:17:22 -07:00
Robert Nishihara
9a6991116f Small fix in test. (#441) 2016-09-25 23:08:27 -07:00
Robert Nishihara
de6ec47f9e Add a recursion depth for serialization to prevent infinite loops. (#440) 2016-09-19 17:17:42 -07:00
Robert Nishihara
91f16a3df0 Migrate repositories to ray-project. (#438)
* Migrate repositories to ray-project.

* Update numbuf to the migrated version.
2016-09-17 00:52:05 -07:00
Robert Nishihara
1aa89a4ae6 Update numbuf to properly handle Python floats. (#435) 2016-09-15 15:44:11 -07:00
Wapaul1
d5815673a5 Changed ray.select() to ray.wait() and its functionality (#426)
* Re-implemented select, changed name to wait

* Changed tests for select to tests for wait

* Updated the hyperopt example to match wait

* Small fixes and improve example readme.

* Make tests pass.
2016-09-14 17:14:11 -07:00
Robert Nishihara
3b47a15ebd Fix naming in tests. (#424) 2016-09-10 21:12:09 -07:00
Robert Nishihara
ba56b08474 Reintroduce passing arguments by value to remote functions. (#425)
* Reintroduce passing arguments by value to remote functions.

* Check size of arguments passed by value.

* Fix computation graph visualization.
2016-09-10 21:11:18 -07:00
Robert Nishihara
0191d42751 Check in runtest.py that the correct version of cloudpickle is installed. (#421) 2016-09-09 16:46:18 -07:00
Robert Nishihara
11a8914684 Allow users to serialize custom classes. (#393)
* Allow serialization of custom classes.

* Add documentation and test cases, also fix pickle case.

* Don't allow old-style classes.
2016-09-06 13:28:24 -07:00
Robert Nishihara
d5cb3ac090 Propagate error messages from functions that run on all workers. (#410) 2016-09-06 10:06:43 -07:00
Robert Nishihara
327d7ff689 Fix bug to enable calling ray.get multiple times on same ObjectID. (#409) 2016-09-04 13:32:55 -07:00
Philipp Moritz
68cec55a98 Refcount without modifying objects (#407)
* refcount without modifying objects

* add documentation

* Update tests and documentation.

* Remove extraneous code.

* Update numbuf version.
2016-09-04 12:07:52 -07:00
Robert Nishihara
81f40774a7 Remove ObjectID aliasing from the API. (#406)
* Remove ObjectID aliasing from the API.

* Update documentation to remove aliasing.
2016-09-03 19:34:45 -07:00
Philipp Moritz
3548797202 [API] Implement get for multiple objects (#398)
* [API] Implement get for multiple objects

* Small fixes.
2016-09-02 18:02:44 -07:00