1
0
Fork 0
mirror of https://github.com/vale981/ray synced 2025-03-17 16:46:39 -04:00
Commit graph

5941 commits

Author SHA1 Message Date
Stephanie Wang
4abafd7e62 Fix bug in ray.wait ()
ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely:

1. Objects A and B are put in the cluster.
2. Client calls ray.wait([A, B], num_returns=1).
3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each.
4. Callback for A fires. The wait completes and the request is removed.
5. Callback for B fires. The wait request no longer exists and raylet crashes.
2018-12-01 19:40:33 -08:00
Eric Liang
13c8ce4d84 Update README.rst with 0.6.0 version number. () 2018-12-01 19:16:45 -08:00
Philipp Moritz
c5b5cdae33 Upgrade Arrow to include Plasma TensorFlow Op release fix ()
This includes a fix so the TensorFlow op releases memory properly (https://github.com/apache/arrow/pull/3061) and the possibility to store arrow data structures in plasma (https://github.com/apache/arrow/pull/2832).

https://github.com/ray-project/ray/issues/3404
2018-12-01 16:15:09 -08:00
Hao Chen
abd37df41e Add stress test for Java worker () 2018-12-01 16:11:09 -08:00
Robert Nishihara
0603e0b73a Bump version from 0.5.3 to 0.6.0. () 2018-12-01 11:39:36 -08:00
Devin Petersohn
57512616e1 Update readme to contain logo ()
* Adding logo to readme

* Updating link

* Add badge

* Addressing comments

* Moving logo

* Change align

* Move image
2018-11-30 18:28:35 -08:00
GiliR4t1qbit
454d3aa07d [docs] Snippet did not have a code-block tag above it () 2018-11-30 16:39:40 -08:00
Stephanie Wang
447604a9fe Use actor ID for the dummy object () 2018-11-29 22:31:04 -08:00
Eric Liang
07d8cbf414
[rllib] Support batch norm layers ()
* batch norm

* lint

* fix dqn/ddpg update ops

* bn model

* Update tf_policy_graph.py

* Update multi_gpu_impl.py

* Apply suggestions from code review

Co-Authored-By: ericl <ekhliang@gmail.com>
2018-11-29 13:33:39 -08:00
Devin Petersohn
4d2010a852 Ship Modin with Ray. () 2018-11-29 20:05:24 +01:00
Stephanie Wang
48a5935224 Fault tolerance for actor creation ()
* Add regression test

* Request actor creation if no actor location found

* Comments

* Address comments

* Increase test timeout

* Trigger test
2018-11-29 10:48:35 -08:00
Chunyang Wen
fd7e494344 Remove: duplicate feed_dict constructing () 2018-11-29 10:21:46 -08:00
Kristian Hartikainen
7e319dbf0c Automatically indent tune logger params () 2018-11-29 00:15:50 -08:00
Eric Liang
c46ea2ff4b
Click 0.7 changes the naming convention for commands; fix this 2018-11-28 14:59:58 -08:00
Tianming Xu
139fbf7884 Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory () 2018-11-27 23:51:18 -08:00
Robert Nishihara
82863b5251
[autoscaler] Update autoscaler to use heartbeat batches. () 2018-11-27 23:46:27 -08:00
Eric Liang
f0df97db6f
[rllib] example and docs on how to use parametric actions with DQN / PG algorithms () 2018-11-27 23:35:19 -08:00
Eric Liang
c2108ca64f Don't put entire actor registry in debug string since it's too long () 2018-11-27 16:48:12 -08:00
Eric Liang
0d56fc10cc Move setproctitle to ray[debug] package () 2018-11-27 09:50:59 -08:00
Robert Nishihara
20b8b1d891 Add script for running stress tests. ()
* Add script for running stress tests.

* Add an actor tree test where actors die with some probability

* Improve test.

* Small fix

* Update tests.

* Minor change
2018-11-27 04:28:02 -08:00
Eric Liang
e3c088fa1e
[rllib] PPO doesn't work with fractional num gpus ()
* frac ppo

* gpu test
2018-11-27 01:14:10 -08:00
Eric Liang
aa94d3dd50 [autoscaler] Allow more than 5s from node creation to first heartbeat () 2018-11-26 17:25:05 -08:00
Robert Nishihara
0f0099fb90 UI changes, fix the task timeline and add the object transfer timeline to UI. ()
* Saving

* Fix cmake and remove object/task search boxes.

* Add comment
2018-11-25 10:16:49 -08:00
Eric Liang
b85e7b43f3
[rllib] Refactor the sampler ()
* refactor

* fix test

* add perf test

* Update sampler.py
2018-11-24 18:16:54 -08:00
Robert Nishihara
3856533065 Fix incompatibility with most recent version of Redis. ()
* Fix incompatibility with most recent version of Redis.

* Fix

* Fixes.
2018-11-24 16:36:38 -08:00
Eric Liang
18a8dbfcfb [rllib] Clip DDPG ou-noise to avoid exceeding action bounds ()
Closes 
2018-11-24 00:56:50 -08:00
Eric Liang
55fca828ce [rllib] Fix use_lstm option when using custom model with dict space ()
## What do these changes do?

This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.

## Related issue number

Closes https://github.com/ray-project/ray/issues/3367
2018-11-23 22:51:08 -08:00
Eric Liang
8b76bab25c
[rllib] docs for td3 ()
* td3 doc

* Update rllib-env.rst
2018-11-22 13:36:47 -08:00
Eric Liang
41b6b50d09 fix py3 () 2018-11-22 11:43:52 -08:00
GiliR4t1qbit
b9ae5edf74 When getting a role/profile, catch only exception that indicates the role/profile already exists, allow others to be raised () 2018-11-22 09:42:58 -08:00
Jones Wong
24bfe8ab76 Enable Twin Delayed DDPG for RLlib DDPG agent () 2018-11-21 20:03:20 -08:00
Stephanie Wang
6b3236349c
Fix memory leak in lineage cache ()
* Move children_ map inside Lineage

* Update lineage_cache.cc

* Test and fixes

* Remove unused
2018-11-21 16:18:39 -08:00
Richard Liaw
784a6399b0
[tune] Node Fault Tolerance ()
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first ()

Closes .
2018-11-21 12:38:16 -08:00
Stephanie Wang
3e33f6f71b
Fix failure handling for actor death ()
* Broadcast actor death, clean up dummy objects

* Reduce logging and clean up state when failing a task

* lint

* Make actor failure test nicer, reduce node timeout
2018-11-21 12:26:22 -08:00
Philipp Moritz
1a926c9b7c Fix $MACOSX_DEPLOYMENT_TARGET () 2018-11-21 10:56:17 -08:00
Eric Liang
686cf20951 Remove uses of std::list::size ()
* worker pool and client conn

* Fix linting

* unordered set

* move
2018-11-20 14:47:55 -08:00
Richard Liaw
c24d87b4d1 [autoscaler] Submit command () 2018-11-20 14:03:34 -08:00
Philipp Moritz
d3697ce4e1
Ready queue refactor to make Dispatching tasks more efficient ()
* put queues outside

* working version, still needs to be optimized

* implement round robin

* proper round robin

* fix spillback

* update

* fix

* cleanup

* more cleanups

* fix

* fix

* add documentation

* explanation for hash combiner

* speed it up

* cleanup and linting

* linting

* comments

* Update scheduling_queue.h

* temp commit

* fixes

* update

* fix

* cleanup

* cleanup

* lint

* more prints

* more prints

* increase sleep

* documentation

* sleep

* fix

* fix

* sleep longer

* update

* fix

* fix

* fix

* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* fixes

* use ordered set

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator

* fix

* fix test

* linting

* lint

* update

* add documentation

* linting
2018-11-20 13:14:12 -08:00
Ujval Misra
b0bfd104f2 Batch heartbeats from node manager together in the monitor. () 2018-11-20 09:52:27 -08:00
Eric Liang
abdc3b592e
[rllib] Update multi-gpu impala numbers () 2018-11-19 20:55:27 -08:00
Eric Liang
5972c29d28 [rllib] Set ape-x local exploration to 0, also load explorations before training steps ()
## What do these changes do?

This should fix high explorations being used after restore / for rollouts.

## Related issue number

(dev list issue)
2018-11-19 20:36:25 -08:00
Eric Liang
afc48d7b77 Don't setpgid() on actors () 2018-11-19 17:35:26 -08:00
Robert Nishihara
f2b5500642 Add ordered_set container. ()
* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator
2018-11-19 17:01:18 -08:00
Eric Liang
d4dbd27e0d Don't retry IPC connect an absurd number of times () 2018-11-19 16:23:59 -08:00
Eric Liang
e4bb5d8d16
Fix logging when ray cluster utils is used 2018-11-18 21:49:27 -08:00
Eric Liang
61e3bbbfee
Update stale example links 2018-11-17 15:40:38 -08:00
Robert Nishihara
5cbc597494 Suppress duplicate pre-emptive object pushes. ()
* Suppress duplicate pre-emptive object pushes.

* Add test.

* Fix linting

* Remove timer and inline recent_pushes_ into local_objects_.

* Improve test.

* Fix

* Fix linting

* Enable retrying pull from same object manager. Randomize object manager.

* Speed up test

* Linting

* Add test.

* Minor

* Lengthen pull timeout and reissue pull every time a new object becomes available.

* Increase pull timeout in test.

* Wait for nodes to start in object manager test.

* Wait longer for nodes to start up in test.

* Small fixes.

* _submit -> _remote

* Change assert to warning.
2018-11-16 23:02:45 -08:00
Wenting Shen
ab1e0f5c2f support home path and relative path for temp-dir () 2018-11-16 17:41:10 -08:00
Robert Nishihara
60b22d9a72 Don't unsubscribe dependencies for infeasible tasks. ()
* Make scheduling queues RemoveTasks return task states as well.

* Add test

* Don't unsubscribe for infeasible tasks when spilling over.

* Linting

* Address comments.
2018-11-16 11:33:00 -08:00
Eric Liang
e0bf9d7305 Add debug string to raylet ()
* initial debug string

* format

* wip debug string

* fix compile

* fix

* update

* finished

* to file

* logs dir

* use temp root

* fix

* override
2018-11-15 21:47:50 -08:00