Commit graph

234 commits

Author SHA1 Message Date
Luke
002531b199 Enable LZ4 compression in pyarrow build (#3931)
Enable LZ4 compression in pyarrow build
2019-02-02 14:38:02 -08:00
Yuhong Guo
54cbb4396f Prepare socket file when start ray (#3925) 2019-02-02 12:53:36 +08:00
Richard Liaw
d128636bab Ray Logging Configuration (#3691)
* fix logging for autoscaler

* module logging

* try this for logging

* yapf

* fix

* Initial logging setup

* momery

* ok

* remove basicconfig

* catch

* remove package logging

* print

* fix

* try_fix

* fix 1

* revert rllib

* logging level

* flake8

* fix

* fix

* Remove vestigal TODO
2019-01-30 21:01:12 -08:00
Si-Yuan
48139cf861 Migrate Python C extension to Cython (#3541) 2019-01-24 09:17:14 -08:00
Philipp Moritz
00e9f8d870 Fix pyarrow version (#3760) 2019-01-13 14:28:23 -08:00
Yuhong Guo
d2cf8561f2 Refactor code about ray.ObjectID. (#3674)
* Refactor code about ray.ObjectID.

* remove from_random and use nil_id instead of constructor

* remove id() in hash

* Lint and fix

* Change driver id to ObjectID

* Replace binary_to_hex(ObjectID.id()) to ObjectID.hex()
2019-01-13 01:47:29 -08:00
Robert Nishihara
8723d6b061 Define a Node class to manage Ray processes. (#3733)
* Implement Node class and move most of services.py into it.

* Wait for nodes as they are added to the cluster.

* Fix Redis authentication bug.

* Fix bug in client table ordering.

* Address comments.

* Kill raylet before plasma store in test.

* Minor
2019-01-11 22:30:38 -08:00
Hao Chen
597abb24ea Refine multi-threading support (#3672)
* [Python] refine multi-threading support

fix

* [java] refine multithreading code

fix java

* format
2019-01-10 13:58:11 -08:00
Robert Nishihara
d1e21b702e Change timeout from milliseconds to seconds in ray.wait. (#3706)
* Change timeout from milliseconds to seconds in ray.wait.

* Suppress warning.

* Suppress warning.

* Add prominent warning in API documentation.
2019-01-08 21:32:08 -08:00
Robert Nishihara
5e76d52868 Improve cluster.wait_for_nodes() API. (#3712)
* Separate out functionality for querying client table and improve cluster.wait_for_nodes() API.

* Linting

* Add back logging statements.

* info -> debug
2019-01-07 21:26:58 -08:00
Robert Nishihara
c9d70f0dda Remove num_local_schedulers argument from ray.worker._init. (#3704)
* Remove num_local_schedulers argument from ray.worker._init.

* Fix

* Fix tests.
2019-01-07 12:44:49 -08:00
Yuhong Guo
c9b8ecca51 Add RayParams to refactor the parameters used by ray python. (#3558) 2018-12-29 22:04:27 +08:00
Yuhong Guo
b9e1977fae Fix failure of test_free_objects_multi_node (#3481)
It is possible that `test_free_objects_multi_node` would fail sometimes. If we run this test 20 times, we may found at least one failure.

The cause is that the test is based on function tasks. One raylet may create more than one worker to execute the tasks. So flush operations may be separated to several workers and not clean all the worker objects held by the plasma client.

In this PR, I change function task to actor tasks, which guarantee all the tasks are executed in one worker of a raylet.
2018-12-06 15:55:49 -05:00
Si-Yuan
2e6f9bedf2 Add the extra fallback for serialization (#3468)
* Add the extra fallback for serialization.

* Better comments & warnings. quotes.

* Update test/runtest.py

Co-Authored-By: suquark <suquark@gmail.com>

* Update test/runtest.py

Co-Authored-By: suquark <suquark@gmail.com>

* linting

* Don't hijack too much errors.

* simplify the test

* Update runtest.py

* simplify
2018-12-05 13:09:08 -08:00
Robert Nishihara
3856533065 Fix incompatibility with most recent version of Redis. (#3379)
* Fix incompatibility with most recent version of Redis.

* Fix

* Fixes.
2018-11-24 16:36:38 -08:00
Robert Nishihara
5cbc597494 Suppress duplicate pre-emptive object pushes. (#3276)
* Suppress duplicate pre-emptive object pushes.

* Add test.

* Fix linting

* Remove timer and inline recent_pushes_ into local_objects_.

* Improve test.

* Fix

* Fix linting

* Enable retrying pull from same object manager. Randomize object manager.

* Speed up test

* Linting

* Add test.

* Minor

* Lengthen pull timeout and reissue pull every time a new object becomes available.

* Increase pull timeout in test.

* Wait for nodes to start in object manager test.

* Wait longer for nodes to start up in test.

* Small fixes.

* _submit -> _remote

* Change assert to warning.
2018-11-16 23:02:45 -08:00
Robert Nishihara
d10cb570ab Rename _submit -> _remote. (#3321) 2018-11-15 15:30:18 -08:00
Philipp Moritz
1be1455d86 Fix redis crash when duplicate messages are appended to log. (#3316) 2018-11-15 15:09:39 -08:00
Stephanie Wang
d950e92f63
Allow multiple threads to call ray.get and ray.wait (#3244)
* Handle multiple threads calling ray.get

* Multithreaded ray.wait

* Pass in current task ID in java backend

* Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get

* Fix test

* Some cleanups

* Improve error message

* Add assertion

* Cleanup, throw error in HandleTaskUnblocked if task not actually blocked

* lint

* Fix python worker reset

* Fix references to reconstruct_objects

* Linting

* java lint

* Fix java

* Fix iterator
2018-11-07 22:39:28 -08:00
Robert Nishihara
1dd5d92789 Enable timeline visualizations of object transfers. (#3255)
* Plot object transfers.

* Linting
2018-11-07 12:45:59 -08:00
Eric Liang
725df3a485 Set the process title in workers and actors (#3219) 2018-11-06 14:59:22 -08:00
Eric Liang
9a0f0db070 Add ray stack tool for debugging (#3213) 2018-11-03 13:13:02 -07:00
Wang Qing
ca7d4c2cf5 Enable to specify driver id by user. (#3084) 2018-11-02 19:01:50 -07:00
Robert Nishihara
5822aa2388 Rename get_task -> worker_idle in timeline. (#3179)
* Rename get_task -> worker_idle in timeline.

* Fix test.
2018-11-02 12:08:46 -07:00
Robert Nishihara
1f29a960f4 Update task_table and object_table API. (#3161)
* Update task_table and object_table API.

* Fix
2018-10-31 12:52:50 -07:00
Robert Nishihara
32f0d6b77e Deprecate num_workers argument to ray.init and ray start. (#3114)
* Remove num_workers argument.

* Fix

* Fix
2018-10-28 20:12:49 -07:00
Robert Nishihara
658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Robert Nishihara
9c1826ed69 Use XRay backend by default. (#3020)
* Use XRay backend by default.

* Remove irrelevant valgrind tests.

* Fix

* Move tests around.

* Fix

* Fix test

* Fix test.

* String/unicode fix.

* Fix test

* Fix unicode issue.

* Minor changes

* Fix bug in test_global_state.py.

* Fix test.

* Linting

* Try arrow change and other object manager changes.

* Use newer plasma client API

* Small updates

* Revert plasma client api change.

* Update

* Update arrow and allow SendObjectHeaders to fail.

* Update arrow

* Update python/ray/experimental/state.py

Co-Authored-By: robertnishihara <robertnishihara@gmail.com>

* Address comments.
2018-10-23 12:46:39 -07:00
Robert Nishihara
22dd7e0428 Add test for wait reconstruction. (#3110) 2018-10-22 23:16:54 -07:00
Robert Nishihara
ed6289771a Convert runtest.py to use pytest. (#2966)
* Convert runtest.py to use pytest.

* Linting.

* Fix

* Fix

* Fix

* Fix
2018-09-30 07:59:44 -07:00
Hanwei Jin
dc76e51a60 bugfix: cmake copy plasma java lib from lib64 directory in centos (#2885) 2018-09-16 22:32:09 -07:00
Robert Nishihara
f16d33593b Mark worker as blocked and trigger reconstruction in ray.wait. (#2864)
* Trigger reconstruction in ray.wait and mark worker as blocked.

* Add test.

* Linting.

* Don't run new test with legacy Ray.

* Only call HandleClientUnblocked if it actually blocked in ray.wait.

* Reduce time to ray.wait in the test.
2018-09-13 15:28:17 -07:00
Robert Nishihara
3f6ed537a4 Add ray.is_initialized() function. (#2818)
* Add ray.is_initialized() function.

* Add assert.
2018-09-06 21:20:59 -07:00
Yucong He
5b45f0bdff [xray] Implementing Gcs sharding (#2409)
Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.).
[+] Implement sharding for gcs tables.
[+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented.
[+] Move AsyncGcsClient's initialization into Connect function.
[-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.
2018-08-31 15:54:30 -07:00
Robert Nishihara
eda6ebb87d Convert some unittests to pytest. (#2779)
* Convert multi_node_test.py to pytest.

* Convert array_test.py to pytest.

* Convert failure_test.py to pytest.

* Convert microbenchmarks to pytest.

* Convert component_failures_test.py to pytest and some minor quotes changes.

* Convert tensorflow_test.py to pytest.

* Convert actor_test.py to pytest.

* Fix.

* Fix
2018-08-31 11:24:15 -07:00
Robert Nishihara
32f7d6fcf5 Add back some tests for xray. (#2772) 2018-08-30 11:07:23 -07:00
Robert Nishihara
b7722897b4 Deprecate 'driver_mode' argument. (#2758)
* Deprecate 'driver_mode' argument.

* Fix

* Fix
2018-08-28 16:45:49 -07:00
Alexey Tumanov
de047daea7 [xray] raylet scheduling mechanism with a simple spillback policy (#2749)
## What do these changes do?
* distribute load and resource information on a heartbeat
* for each raylet, maintain total and available resource capacity as well as measure of current load
* this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load.
* modify the scheduling policy to perform *capacity-based*, *load-aware*, *optimistically concurrent* resource allocation
* perform task spillover to the heartbeating node in response to a heartbeat, implementing  heterogeneity-aware late-binding/work-stealing.
2018-08-28 00:03:34 -07:00
Yuhong Guo
eeb15771ba Add ray.internal.free (#2542) 2018-08-14 22:01:23 -07:00
Philipp Moritz
d8ba667175 Convert asserts in unittest to pytest (#2529) 2018-08-01 22:32:10 -07:00
Robert Nishihara
909d7172b1 Introduce constant for ID_SIZE in python code. (#2517) 2018-07-31 12:40:53 -07:00
Philipp Moritz
696a229ece Fix text verbosity in python 2.7 by running tests with pytest (#2470) 2018-07-30 11:04:06 -07:00
Hao Chen
05f485e274 Allow Ray API to be used from multiple threads (#2422) 2018-07-20 15:39:01 -07:00
Robert Nishihara
515da7721a Change ray.worker.cleanup -> ray.shutdown and improve API documentation. (#2374)
* Change ray.worker.cleanup -> ray.shutdown and improve API documentation.

* Deprecate ray.worker.cleanup() gracefully.

* Fix linting
2018-07-12 12:00:00 -07:00
Robert Nishihara
b90e551b41 [xray] Implement timeline and profiling API. (#2306)
* Add profile table and store profiling information there.

* Code for dumping timeline.

* Improve color scheme.

* Push timeline events on driver only for raylet.

* Improvements to profiling and timeline visualization

* Some linting

* Small fix.

* Linting

* Propagate node IP address through profiling events.

* Fix test.

* object_id.hex() should return byte string in python 2.

* Include gcs.fbs in node_manager.fbs.

* Remove flatbuffer definition duplication.

* Decode to unicode in Python 3 and bytes in Python 2.

* Minor

* Submit profile events in a batch. Revert some CMake changes.

* Fix

* Workaround test failure.

* Fix linting

* Linting

* Don't return anything from chrome_tracing_dump when filename is provided.

* Remove some redundancy from profile table.

* Linting

* Move TODOs out of docstring.

* Minor
2018-07-04 23:23:48 -07:00
Robert Nishihara
800f7cc77d Make actor handles work in Python mode. (#2283)
* Make actor handles work in local mode.

* Add test for actor handles in local mode.
2018-06-20 23:02:41 -07:00
Robert Nishihara
ff2217251f [xray] Add error table and push error messages to driver through node manager. (#2256)
* Fix documentation indentation.

* Add error table to GCS and push error messages through node manager.

* Add type to error data.

* Linting

* Fix failure_test bug.

* Linting.

* Enable one more test.

* Attempt to fix doc building.

* Restructuring

* Fixes

* More fixes.

* Move current_time_ms function into util.h.
2018-06-20 21:29:28 -07:00
Robert Nishihara
61139e1509 Enable fractional resources and resource IDs for xray. (#2187)
* Implement GPU IDs and fractional resources.

* Add documentation and python exceptions.

* Fix signed/unsigned comparison.

* Fix linting.

* Fixes from rebase.

* Re-enable tests that use ray.wait.

* Don't kill the raylet if an infeasible task is submitted.

* Ignore tests that require better load balancing.

* Linting

* Ignore array test.

* Ignore stress test reconstructions tests.

* Don't kill node manager if remote node manager disconnects.

* Ignore more stress tests.

* Naming changes

* Remove outdated todo

* Small fix

* Re-enable test.

* Linting

* Fix resource bookkeeping for blocked tasks.

* Fix linting

* Fix Java client.

* Ignore test

* Ignore put error tests
2018-06-10 15:31:43 -07:00
Melih Elibol
7246ff80a4
[xray] Implements ray.wait (#2162)
Implements ray.wait for xray. Fixes #1128.
2018-06-06 16:56:44 -07:00
Kunal Gosar
317d0da7d8 Add experimental API for ray.get and ray.wait with additional argument types (#2071) 2018-06-01 16:42:27 -07:00