Eric Liang
07d8cbf414
[rllib] Support batch norm layers ( #3369 )
...
* batch norm
* lint
* fix dqn/ddpg update ops
* bn model
* Update tf_policy_graph.py
* Update multi_gpu_impl.py
* Apply suggestions from code review
Co-Authored-By: ericl <ekhliang@gmail.com>
2018-11-29 13:33:39 -08:00
Devin Petersohn
4d2010a852
Ship Modin with Ray. ( #3109 )
2018-11-29 20:05:24 +01:00
Stephanie Wang
48a5935224
Fault tolerance for actor creation ( #3422 )
...
* Add regression test
* Request actor creation if no actor location found
* Comments
* Address comments
* Increase test timeout
* Trigger test
2018-11-29 10:48:35 -08:00
Chunyang Wen
fd7e494344
Remove: duplicate feed_dict constructing ( #3431 )
2018-11-29 10:21:46 -08:00
Kristian Hartikainen
7e319dbf0c
Automatically indent tune logger params ( #3399 )
2018-11-29 00:15:50 -08:00
Eric Liang
c46ea2ff4b
Click 0.7 changes the naming convention for commands; fix this
2018-11-28 14:59:58 -08:00
Tianming Xu
139fbf7884
Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory ( #3403 )
2018-11-27 23:51:18 -08:00
Robert Nishihara
82863b5251
[autoscaler] Update autoscaler to use heartbeat batches. ( #3409 )
2018-11-27 23:46:27 -08:00
Eric Liang
f0df97db6f
[rllib] example and docs on how to use parametric actions with DQN / PG algorithms ( #3384 )
2018-11-27 23:35:19 -08:00
Eric Liang
c2108ca64f
Don't put entire actor registry in debug string since it's too long ( #3395 )
2018-11-27 16:48:12 -08:00
Eric Liang
0d56fc10cc
Move setproctitle to ray[debug] package ( #3415 )
2018-11-27 09:50:59 -08:00
Robert Nishihara
20b8b1d891
Add script for running stress tests. ( #3378 )
...
* Add script for running stress tests.
* Add an actor tree test where actors die with some probability
* Improve test.
* Small fix
* Update tests.
* Minor change
2018-11-27 04:28:02 -08:00
Eric Liang
e3c088fa1e
[rllib] PPO doesn't work with fractional num gpus ( #3396 )
...
* frac ppo
* gpu test
2018-11-27 01:14:10 -08:00
Eric Liang
aa94d3dd50
[autoscaler] Allow more than 5s from node creation to first heartbeat ( #3385 )
2018-11-26 17:25:05 -08:00
Robert Nishihara
0f0099fb90
UI changes, fix the task timeline and add the object transfer timeline to UI. ( #3397 )
...
* Saving
* Fix cmake and remove object/task search boxes.
* Add comment
2018-11-25 10:16:49 -08:00
Eric Liang
b85e7b43f3
[rllib] Refactor the sampler ( #3387 )
...
* refactor
* fix test
* add perf test
* Update sampler.py
2018-11-24 18:16:54 -08:00
Robert Nishihara
3856533065
Fix incompatibility with most recent version of Redis. ( #3379 )
...
* Fix incompatibility with most recent version of Redis.
* Fix
* Fixes.
2018-11-24 16:36:38 -08:00
Eric Liang
18a8dbfcfb
[rllib] Clip DDPG ou-noise to avoid exceeding action bounds ( #3386 )
...
Closes #2965
2018-11-24 00:56:50 -08:00
Eric Liang
55fca828ce
[rllib] Fix use_lstm option when using custom model with dict space ( #3368 )
...
## What do these changes do?
This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.
## Related issue number
Closes https://github.com/ray-project/ray/issues/3367
2018-11-23 22:51:08 -08:00
Eric Liang
8b76bab25c
[rllib] docs for td3 ( #3381 )
...
* td3 doc
* Update rllib-env.rst
2018-11-22 13:36:47 -08:00
Eric Liang
41b6b50d09
fix py3 ( #3382 )
2018-11-22 11:43:52 -08:00
GiliR4t1qbit
b9ae5edf74
When getting a role/profile, catch only exception that indicates the role/profile already exists, allow others to be raised ( #3383 )
2018-11-22 09:42:58 -08:00
Jones Wong
24bfe8ab76
Enable Twin Delayed DDPG for RLlib DDPG agent ( #3353 )
2018-11-21 20:03:20 -08:00
Stephanie Wang
6b3236349c
Fix memory leak in lineage cache ( #3366 )
...
* Move children_ map inside Lineage
* Update lineage_cache.cc
* Test and fixes
* Remove unused
2018-11-21 16:18:39 -08:00
Richard Liaw
784a6399b0
[tune] Node Fault Tolerance ( #3238 )
...
This PR introduces single-node fault tolerance for Tune.
## Previous behavior:
- Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.
## New behavior:
- RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
- If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
- During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.
Remaining questions:
- Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).
- Waiting for some PRs to merge first (#3239 )
Closes #2851 .
2018-11-21 12:38:16 -08:00
Stephanie Wang
3e33f6f71b
Fix failure handling for actor death ( #3359 )
...
* Broadcast actor death, clean up dummy objects
* Reduce logging and clean up state when failing a task
* lint
* Make actor failure test nicer, reduce node timeout
2018-11-21 12:26:22 -08:00
Philipp Moritz
1a926c9b7c
Fix $MACOSX_DEPLOYMENT_TARGET ( #3337 )
2018-11-21 10:56:17 -08:00
Eric Liang
686cf20951
Remove uses of std::list::size ( #3358 )
...
* worker pool and client conn
* Fix linting
* unordered set
* move
2018-11-20 14:47:55 -08:00
Richard Liaw
c24d87b4d1
[autoscaler] Submit command ( #3312 )
2018-11-20 14:03:34 -08:00
Philipp Moritz
d3697ce4e1
Ready queue refactor to make Dispatching tasks more efficient ( #3324 )
...
* put queues outside
* working version, still needs to be optimized
* implement round robin
* proper round robin
* fix spillback
* update
* fix
* cleanup
* more cleanups
* fix
* fix
* add documentation
* explanation for hash combiner
* speed it up
* cleanup and linting
* linting
* comments
* Update scheduling_queue.h
* temp commit
* fixes
* update
* fix
* cleanup
* cleanup
* lint
* more prints
* more prints
* increase sleep
* documentation
* sleep
* fix
* fix
* sleep longer
* update
* fix
* fix
* fix
* Add ordered_set container.
* Fix
* Linting
* Constructors
* Remove O(n) call to list.size().
* fixes
* use ordered set
* Fix.
* Add documentation.
* Add iterators to ordered_set container implementation.
* iterator_type -> iterator
* Make typedefs private
* Add const_iterator
* fix
* fix test
* linting
* lint
* update
* add documentation
* linting
2018-11-20 13:14:12 -08:00
Ujval Misra
b0bfd104f2
Batch heartbeats from node manager together in the monitor. ( #3011 )
2018-11-20 09:52:27 -08:00
Eric Liang
abdc3b592e
[rllib] Update multi-gpu impala numbers ( #3327 )
2018-11-19 20:55:27 -08:00
Eric Liang
5972c29d28
[rllib] Set ape-x local exploration to 0, also load explorations before training steps ( #3349 )
...
## What do these changes do?
This should fix high explorations being used after restore / for rollouts.
## Related issue number
(dev list issue)
2018-11-19 20:36:25 -08:00
Eric Liang
afc48d7b77
Don't setpgid() on actors ( #3347 )
2018-11-19 17:35:26 -08:00
Robert Nishihara
f2b5500642
Add ordered_set container. ( #3352 )
...
* Add ordered_set container.
* Fix
* Linting
* Constructors
* Remove O(n) call to list.size().
* Fix.
* Add documentation.
* Add iterators to ordered_set container implementation.
* iterator_type -> iterator
* Make typedefs private
* Add const_iterator
2018-11-19 17:01:18 -08:00
Eric Liang
d4dbd27e0d
Don't retry IPC connect an absurd number of times ( #3355 )
2018-11-19 16:23:59 -08:00
Eric Liang
e4bb5d8d16
Fix logging when ray cluster utils is used
2018-11-18 21:49:27 -08:00
Eric Liang
61e3bbbfee
Update stale example links
2018-11-17 15:40:38 -08:00
Robert Nishihara
5cbc597494
Suppress duplicate pre-emptive object pushes. ( #3276 )
...
* Suppress duplicate pre-emptive object pushes.
* Add test.
* Fix linting
* Remove timer and inline recent_pushes_ into local_objects_.
* Improve test.
* Fix
* Fix linting
* Enable retrying pull from same object manager. Randomize object manager.
* Speed up test
* Linting
* Add test.
* Minor
* Lengthen pull timeout and reissue pull every time a new object becomes available.
* Increase pull timeout in test.
* Wait for nodes to start in object manager test.
* Wait longer for nodes to start up in test.
* Small fixes.
* _submit -> _remote
* Change assert to warning.
2018-11-16 23:02:45 -08:00
Wenting Shen
ab1e0f5c2f
support home path and relative path for temp-dir ( #3329 )
2018-11-16 17:41:10 -08:00
Robert Nishihara
60b22d9a72
Don't unsubscribe dependencies for infeasible tasks. ( #3338 )
...
* Make scheduling queues RemoveTasks return task states as well.
* Add test
* Don't unsubscribe for infeasible tasks when spilling over.
* Linting
* Address comments.
2018-11-16 11:33:00 -08:00
Eric Liang
e0bf9d7305
Add debug string to raylet ( #3317 )
...
* initial debug string
* format
* wip debug string
* fix compile
* fix
* update
* finished
* to file
* logs dir
* use temp root
* fix
* override
2018-11-15 21:47:50 -08:00
Robert Nishihara
d10cb570ab
Rename _submit -> _remote. ( #3321 )
2018-11-15 15:30:18 -08:00
Robert Nishihara
98edf752a9
Note requirement cython==0.27.3 in installation instructions. ( #3322 )
2018-11-15 15:27:19 -08:00
Philipp Moritz
1be1455d86
Fix redis crash when duplicate messages are appended to log. ( #3316 )
2018-11-15 15:09:39 -08:00
Eric Liang
5723291db6
Raise exception if the node is nearly out of memory ( #3323 )
...
* wip
* add
* comment
* escape hatch
* update
* object store too
* .2
2018-11-15 12:55:25 -08:00
Philipp Moritz
b6a12d1f97
Fix socket retry message ( #3325 )
2018-11-15 12:14:19 -08:00
Lewis Belcher
5319fd044c
Update redis version in setup.py ( #3333 )
...
* `redis` has released a new version (https://github.com/andymccurdy/redis-py/releases/tag/3.0.0 )
* `ray` is not compatible with this version
* This PR adds the "compatible release" operator for `redis` version 2.10.6.
2018-11-15 10:40:08 -08:00
Eric Liang
706dc1d473
[rllib] Add test for multi-agent support and fix IMPALA multi-agent ( #3289 )
...
IMPALA support for multiagent was broken since IMPALA has a requirement that batch sizes be of a certain length. However multi-agent envs can create variable-length batches.
Fix this by adding zero-padding as needed (similar to the RNN case).
2018-11-14 14:14:07 -08:00
andrewztan
57c7b4238e
KL Divergence Metrics ( #3300 )
...
* added KL divergence metrics
* fix
2018-11-13 23:12:35 -08:00