Kristian Hartikainen
be6567e6fd
Tweak/exec attach info ( #3447 )
...
* Add custom cluster name to exec info
* Update submit info to match exec info
2018-12-03 21:39:43 -08:00
Eric Liang
d8205976e8
[rllib] Auto clip actions to Box space range; deprecate squash_to_range ( #3426 )
...
* fix clip
* tweak wording
* remove squash entirely
* Update rllib-models.rst
* fix argument order
* Apply suggestions from code review
Co-Authored-By: ericl <ekhliang@gmail.com>
2018-12-03 19:55:25 -08:00
Eric Liang
7abfbfd2f7
[rllib] Better error message for unsupported non-atari image observation sizes ( #3444 )
2018-12-03 01:24:36 -08:00
Stephanie Wang
4abafd7e62
Fix bug in ray.wait ( #3445 )
...
ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely:
1. Objects A and B are put in the cluster.
2. Client calls ray.wait([A, B], num_returns=1).
3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each.
4. Callback for A fires. The wait completes and the request is removed.
5. Callback for B fires. The wait request no longer exists and raylet crashes.
2018-12-01 19:40:33 -08:00
Eric Liang
13c8ce4d84
Update README.rst with 0.6.0 version number. ( #3453 )
2018-12-01 19:16:45 -08:00
Philipp Moritz
c5b5cdae33
Upgrade Arrow to include Plasma TensorFlow Op release fix ( #3448 )
...
This includes a fix so the TensorFlow op releases memory properly (https://github.com/apache/arrow/pull/3061 ) and the possibility to store arrow data structures in plasma (https://github.com/apache/arrow/pull/2832 ).
https://github.com/ray-project/ray/issues/3404
2018-12-01 16:15:09 -08:00
Hao Chen
abd37df41e
Add stress test for Java worker ( #3424 )
2018-12-01 16:11:09 -08:00
Robert Nishihara
0603e0b73a
Bump version from 0.5.3 to 0.6.0. ( #3420 )
2018-12-01 11:39:36 -08:00
Devin Petersohn
57512616e1
Update readme to contain logo ( #3443 )
...
* Adding logo to readme
* Updating link
* Add badge
* Addressing comments
* Moving logo
* Change align
* Move image
2018-11-30 18:28:35 -08:00
GiliR4t1qbit
454d3aa07d
[docs] Snippet did not have a code-block tag above it ( #3442 )
2018-11-30 16:39:40 -08:00
Stephanie Wang
447604a9fe
Use actor ID for the dummy object ( #3437 )
2018-11-29 22:31:04 -08:00
Eric Liang
07d8cbf414
[rllib] Support batch norm layers ( #3369 )
...
* batch norm
* lint
* fix dqn/ddpg update ops
* bn model
* Update tf_policy_graph.py
* Update multi_gpu_impl.py
* Apply suggestions from code review
Co-Authored-By: ericl <ekhliang@gmail.com>
2018-11-29 13:33:39 -08:00
Devin Petersohn
4d2010a852
Ship Modin with Ray. ( #3109 )
2018-11-29 20:05:24 +01:00
Stephanie Wang
48a5935224
Fault tolerance for actor creation ( #3422 )
...
* Add regression test
* Request actor creation if no actor location found
* Comments
* Address comments
* Increase test timeout
* Trigger test
2018-11-29 10:48:35 -08:00
Chunyang Wen
fd7e494344
Remove: duplicate feed_dict constructing ( #3431 )
2018-11-29 10:21:46 -08:00
Kristian Hartikainen
7e319dbf0c
Automatically indent tune logger params ( #3399 )
2018-11-29 00:15:50 -08:00
Eric Liang
c46ea2ff4b
Click 0.7 changes the naming convention for commands; fix this
2018-11-28 14:59:58 -08:00
Tianming Xu
139fbf7884
Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory ( #3403 )
2018-11-27 23:51:18 -08:00
Robert Nishihara
82863b5251
[autoscaler] Update autoscaler to use heartbeat batches. ( #3409 )
2018-11-27 23:46:27 -08:00
Eric Liang
f0df97db6f
[rllib] example and docs on how to use parametric actions with DQN / PG algorithms ( #3384 )
2018-11-27 23:35:19 -08:00
Eric Liang
c2108ca64f
Don't put entire actor registry in debug string since it's too long ( #3395 )
2018-11-27 16:48:12 -08:00
Eric Liang
0d56fc10cc
Move setproctitle to ray[debug] package ( #3415 )
2018-11-27 09:50:59 -08:00
Robert Nishihara
20b8b1d891
Add script for running stress tests. ( #3378 )
...
* Add script for running stress tests.
* Add an actor tree test where actors die with some probability
* Improve test.
* Small fix
* Update tests.
* Minor change
2018-11-27 04:28:02 -08:00
Eric Liang
e3c088fa1e
[rllib] PPO doesn't work with fractional num gpus ( #3396 )
...
* frac ppo
* gpu test
2018-11-27 01:14:10 -08:00
Eric Liang
aa94d3dd50
[autoscaler] Allow more than 5s from node creation to first heartbeat ( #3385 )
2018-11-26 17:25:05 -08:00
Robert Nishihara
0f0099fb90
UI changes, fix the task timeline and add the object transfer timeline to UI. ( #3397 )
...
* Saving
* Fix cmake and remove object/task search boxes.
* Add comment
2018-11-25 10:16:49 -08:00
Eric Liang
b85e7b43f3
[rllib] Refactor the sampler ( #3387 )
...
* refactor
* fix test
* add perf test
* Update sampler.py
2018-11-24 18:16:54 -08:00
Robert Nishihara
3856533065
Fix incompatibility with most recent version of Redis. ( #3379 )
...
* Fix incompatibility with most recent version of Redis.
* Fix
* Fixes.
2018-11-24 16:36:38 -08:00
Eric Liang
18a8dbfcfb
[rllib] Clip DDPG ou-noise to avoid exceeding action bounds ( #3386 )
...
Closes #2965
2018-11-24 00:56:50 -08:00
Eric Liang
55fca828ce
[rllib] Fix use_lstm option when using custom model with dict space ( #3368 )
...
## What do these changes do?
This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.
## Related issue number
Closes https://github.com/ray-project/ray/issues/3367
2018-11-23 22:51:08 -08:00
Eric Liang
8b76bab25c
[rllib] docs for td3 ( #3381 )
...
* td3 doc
* Update rllib-env.rst
2018-11-22 13:36:47 -08:00
Eric Liang
41b6b50d09
fix py3 ( #3382 )
2018-11-22 11:43:52 -08:00
GiliR4t1qbit
b9ae5edf74
When getting a role/profile, catch only exception that indicates the role/profile already exists, allow others to be raised ( #3383 )
2018-11-22 09:42:58 -08:00
Jones Wong
24bfe8ab76
Enable Twin Delayed DDPG for RLlib DDPG agent ( #3353 )
2018-11-21 20:03:20 -08:00
Stephanie Wang
6b3236349c
Fix memory leak in lineage cache ( #3366 )
...
* Move children_ map inside Lineage
* Update lineage_cache.cc
* Test and fixes
* Remove unused
2018-11-21 16:18:39 -08:00
Richard Liaw
784a6399b0
[tune] Node Fault Tolerance ( #3238 )
...
This PR introduces single-node fault tolerance for Tune.
## Previous behavior:
- Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.
## New behavior:
- RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
- If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
- During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.
Remaining questions:
- Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).
- Waiting for some PRs to merge first (#3239 )
Closes #2851 .
2018-11-21 12:38:16 -08:00
Stephanie Wang
3e33f6f71b
Fix failure handling for actor death ( #3359 )
...
* Broadcast actor death, clean up dummy objects
* Reduce logging and clean up state when failing a task
* lint
* Make actor failure test nicer, reduce node timeout
2018-11-21 12:26:22 -08:00
Philipp Moritz
1a926c9b7c
Fix $MACOSX_DEPLOYMENT_TARGET ( #3337 )
2018-11-21 10:56:17 -08:00
Eric Liang
686cf20951
Remove uses of std::list::size ( #3358 )
...
* worker pool and client conn
* Fix linting
* unordered set
* move
2018-11-20 14:47:55 -08:00
Richard Liaw
c24d87b4d1
[autoscaler] Submit command ( #3312 )
2018-11-20 14:03:34 -08:00
Philipp Moritz
d3697ce4e1
Ready queue refactor to make Dispatching tasks more efficient ( #3324 )
...
* put queues outside
* working version, still needs to be optimized
* implement round robin
* proper round robin
* fix spillback
* update
* fix
* cleanup
* more cleanups
* fix
* fix
* add documentation
* explanation for hash combiner
* speed it up
* cleanup and linting
* linting
* comments
* Update scheduling_queue.h
* temp commit
* fixes
* update
* fix
* cleanup
* cleanup
* lint
* more prints
* more prints
* increase sleep
* documentation
* sleep
* fix
* fix
* sleep longer
* update
* fix
* fix
* fix
* Add ordered_set container.
* Fix
* Linting
* Constructors
* Remove O(n) call to list.size().
* fixes
* use ordered set
* Fix.
* Add documentation.
* Add iterators to ordered_set container implementation.
* iterator_type -> iterator
* Make typedefs private
* Add const_iterator
* fix
* fix test
* linting
* lint
* update
* add documentation
* linting
2018-11-20 13:14:12 -08:00
Ujval Misra
b0bfd104f2
Batch heartbeats from node manager together in the monitor. ( #3011 )
2018-11-20 09:52:27 -08:00
Eric Liang
abdc3b592e
[rllib] Update multi-gpu impala numbers ( #3327 )
2018-11-19 20:55:27 -08:00
Eric Liang
5972c29d28
[rllib] Set ape-x local exploration to 0, also load explorations before training steps ( #3349 )
...
## What do these changes do?
This should fix high explorations being used after restore / for rollouts.
## Related issue number
(dev list issue)
2018-11-19 20:36:25 -08:00
Eric Liang
afc48d7b77
Don't setpgid() on actors ( #3347 )
2018-11-19 17:35:26 -08:00
Robert Nishihara
f2b5500642
Add ordered_set container. ( #3352 )
...
* Add ordered_set container.
* Fix
* Linting
* Constructors
* Remove O(n) call to list.size().
* Fix.
* Add documentation.
* Add iterators to ordered_set container implementation.
* iterator_type -> iterator
* Make typedefs private
* Add const_iterator
2018-11-19 17:01:18 -08:00
Eric Liang
d4dbd27e0d
Don't retry IPC connect an absurd number of times ( #3355 )
2018-11-19 16:23:59 -08:00
Eric Liang
e4bb5d8d16
Fix logging when ray cluster utils is used
2018-11-18 21:49:27 -08:00
Eric Liang
61e3bbbfee
Update stale example links
2018-11-17 15:40:38 -08:00
Robert Nishihara
5cbc597494
Suppress duplicate pre-emptive object pushes. ( #3276 )
...
* Suppress duplicate pre-emptive object pushes.
* Add test.
* Fix linting
* Remove timer and inline recent_pushes_ into local_objects_.
* Improve test.
* Fix
* Fix linting
* Enable retrying pull from same object manager. Randomize object manager.
* Speed up test
* Linting
* Add test.
* Minor
* Lengthen pull timeout and reissue pull every time a new object becomes available.
* Increase pull timeout in test.
* Wait for nodes to start in object manager test.
* Wait longer for nodes to start up in test.
* Small fixes.
* _submit -> _remote
* Change assert to warning.
2018-11-16 23:02:45 -08:00