Commit graph

2244 commits

Author SHA1 Message Date
Kristian Hartikainen
be6567e6fd Tweak/exec attach info (#3447)
* Add custom cluster name to exec info

* Update submit info to match exec info
2018-12-03 21:39:43 -08:00
Eric Liang
d8205976e8
[rllib] Auto clip actions to Box space range; deprecate squash_to_range (#3426)
* fix clip

* tweak wording

* remove squash entirely

* Update rllib-models.rst

* fix argument order

* Apply suggestions from code review

Co-Authored-By: ericl <ekhliang@gmail.com>
2018-12-03 19:55:25 -08:00
Eric Liang
7abfbfd2f7
[rllib] Better error message for unsupported non-atari image observation sizes (#3444) 2018-12-03 01:24:36 -08:00
Stephanie Wang
4abafd7e62 Fix bug in ray.wait (#3445)
ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely:

1. Objects A and B are put in the cluster.
2. Client calls ray.wait([A, B], num_returns=1).
3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each.
4. Callback for A fires. The wait completes and the request is removed.
5. Callback for B fires. The wait request no longer exists and raylet crashes.
2018-12-01 19:40:33 -08:00
Eric Liang
13c8ce4d84 Update README.rst with 0.6.0 version number. (#3453) 2018-12-01 19:16:45 -08:00
Philipp Moritz
c5b5cdae33 Upgrade Arrow to include Plasma TensorFlow Op release fix (#3448)
This includes a fix so the TensorFlow op releases memory properly (https://github.com/apache/arrow/pull/3061) and the possibility to store arrow data structures in plasma (https://github.com/apache/arrow/pull/2832).

https://github.com/ray-project/ray/issues/3404
2018-12-01 16:15:09 -08:00
Hao Chen
abd37df41e Add stress test for Java worker (#3424) 2018-12-01 16:11:09 -08:00
Robert Nishihara
0603e0b73a Bump version from 0.5.3 to 0.6.0. (#3420) 2018-12-01 11:39:36 -08:00
Devin Petersohn
57512616e1 Update readme to contain logo (#3443)
* Adding logo to readme

* Updating link

* Add badge

* Addressing comments

* Moving logo

* Change align

* Move image
2018-11-30 18:28:35 -08:00
GiliR4t1qbit
454d3aa07d [docs] Snippet did not have a code-block tag above it (#3442) 2018-11-30 16:39:40 -08:00
Stephanie Wang
447604a9fe Use actor ID for the dummy object (#3437) 2018-11-29 22:31:04 -08:00
Eric Liang
07d8cbf414
[rllib] Support batch norm layers (#3369)
* batch norm

* lint

* fix dqn/ddpg update ops

* bn model

* Update tf_policy_graph.py

* Update multi_gpu_impl.py

* Apply suggestions from code review

Co-Authored-By: ericl <ekhliang@gmail.com>
2018-11-29 13:33:39 -08:00
Devin Petersohn
4d2010a852 Ship Modin with Ray. (#3109) 2018-11-29 20:05:24 +01:00
Stephanie Wang
48a5935224 Fault tolerance for actor creation (#3422)
* Add regression test

* Request actor creation if no actor location found

* Comments

* Address comments

* Increase test timeout

* Trigger test
2018-11-29 10:48:35 -08:00
Chunyang Wen
fd7e494344 Remove: duplicate feed_dict constructing (#3431) 2018-11-29 10:21:46 -08:00
Kristian Hartikainen
7e319dbf0c Automatically indent tune logger params (#3399) 2018-11-29 00:15:50 -08:00
Eric Liang
c46ea2ff4b
Click 0.7 changes the naming convention for commands; fix this 2018-11-28 14:59:58 -08:00
Tianming Xu
139fbf7884 Initialize client_id_ in ObjectManager constructor that takes user-defined ObjectDirectory (#3403) 2018-11-27 23:51:18 -08:00
Robert Nishihara
82863b5251
[autoscaler] Update autoscaler to use heartbeat batches. (#3409) 2018-11-27 23:46:27 -08:00
Eric Liang
f0df97db6f
[rllib] example and docs on how to use parametric actions with DQN / PG algorithms (#3384) 2018-11-27 23:35:19 -08:00
Eric Liang
c2108ca64f Don't put entire actor registry in debug string since it's too long (#3395) 2018-11-27 16:48:12 -08:00
Eric Liang
0d56fc10cc Move setproctitle to ray[debug] package (#3415) 2018-11-27 09:50:59 -08:00
Robert Nishihara
20b8b1d891 Add script for running stress tests. (#3378)
* Add script for running stress tests.

* Add an actor tree test where actors die with some probability

* Improve test.

* Small fix

* Update tests.

* Minor change
2018-11-27 04:28:02 -08:00
Eric Liang
e3c088fa1e
[rllib] PPO doesn't work with fractional num gpus (#3396)
* frac ppo

* gpu test
2018-11-27 01:14:10 -08:00
Eric Liang
aa94d3dd50 [autoscaler] Allow more than 5s from node creation to first heartbeat (#3385) 2018-11-26 17:25:05 -08:00
Robert Nishihara
0f0099fb90 UI changes, fix the task timeline and add the object transfer timeline to UI. (#3397)
* Saving

* Fix cmake and remove object/task search boxes.

* Add comment
2018-11-25 10:16:49 -08:00
Eric Liang
b85e7b43f3
[rllib] Refactor the sampler (#3387)
* refactor

* fix test

* add perf test

* Update sampler.py
2018-11-24 18:16:54 -08:00
Robert Nishihara
3856533065 Fix incompatibility with most recent version of Redis. (#3379)
* Fix incompatibility with most recent version of Redis.

* Fix

* Fixes.
2018-11-24 16:36:38 -08:00
Eric Liang
18a8dbfcfb [rllib] Clip DDPG ou-noise to avoid exceeding action bounds (#3386)
Closes #2965
2018-11-24 00:56:50 -08:00
Eric Liang
55fca828ce [rllib] Fix use_lstm option when using custom model with dict space (#3368)
## What do these changes do?

This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.

## Related issue number

Closes https://github.com/ray-project/ray/issues/3367
2018-11-23 22:51:08 -08:00
Eric Liang
8b76bab25c
[rllib] docs for td3 (#3381)
* td3 doc

* Update rllib-env.rst
2018-11-22 13:36:47 -08:00
Eric Liang
41b6b50d09 fix py3 (#3382) 2018-11-22 11:43:52 -08:00
GiliR4t1qbit
b9ae5edf74 When getting a role/profile, catch only exception that indicates the role/profile already exists, allow others to be raised (#3383) 2018-11-22 09:42:58 -08:00
Jones Wong
24bfe8ab76 Enable Twin Delayed DDPG for RLlib DDPG agent (#3353) 2018-11-21 20:03:20 -08:00
Stephanie Wang
6b3236349c
Fix memory leak in lineage cache (#3366)
* Move children_ map inside Lineage

* Update lineage_cache.cc

* Test and fixes

* Remove unused
2018-11-21 16:18:39 -08:00
Richard Liaw
784a6399b0
[tune] Node Fault Tolerance (#3238)
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.
2018-11-21 12:38:16 -08:00
Stephanie Wang
3e33f6f71b
Fix failure handling for actor death (#3359)
* Broadcast actor death, clean up dummy objects

* Reduce logging and clean up state when failing a task

* lint

* Make actor failure test nicer, reduce node timeout
2018-11-21 12:26:22 -08:00
Philipp Moritz
1a926c9b7c Fix $MACOSX_DEPLOYMENT_TARGET (#3337) 2018-11-21 10:56:17 -08:00
Eric Liang
686cf20951 Remove uses of std::list::size (#3358)
* worker pool and client conn

* Fix linting

* unordered set

* move
2018-11-20 14:47:55 -08:00
Richard Liaw
c24d87b4d1 [autoscaler] Submit command (#3312) 2018-11-20 14:03:34 -08:00
Philipp Moritz
d3697ce4e1
Ready queue refactor to make Dispatching tasks more efficient (#3324)
* put queues outside

* working version, still needs to be optimized

* implement round robin

* proper round robin

* fix spillback

* update

* fix

* cleanup

* more cleanups

* fix

* fix

* add documentation

* explanation for hash combiner

* speed it up

* cleanup and linting

* linting

* comments

* Update scheduling_queue.h

* temp commit

* fixes

* update

* fix

* cleanup

* cleanup

* lint

* more prints

* more prints

* increase sleep

* documentation

* sleep

* fix

* fix

* sleep longer

* update

* fix

* fix

* fix

* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* fixes

* use ordered set

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator

* fix

* fix test

* linting

* lint

* update

* add documentation

* linting
2018-11-20 13:14:12 -08:00
Ujval Misra
b0bfd104f2 Batch heartbeats from node manager together in the monitor. (#3011) 2018-11-20 09:52:27 -08:00
Eric Liang
abdc3b592e
[rllib] Update multi-gpu impala numbers (#3327) 2018-11-19 20:55:27 -08:00
Eric Liang
5972c29d28 [rllib] Set ape-x local exploration to 0, also load explorations before training steps (#3349)
## What do these changes do?

This should fix high explorations being used after restore / for rollouts.

## Related issue number

(dev list issue)
2018-11-19 20:36:25 -08:00
Eric Liang
afc48d7b77 Don't setpgid() on actors (#3347) 2018-11-19 17:35:26 -08:00
Robert Nishihara
f2b5500642 Add ordered_set container. (#3352)
* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator
2018-11-19 17:01:18 -08:00
Eric Liang
d4dbd27e0d Don't retry IPC connect an absurd number of times (#3355) 2018-11-19 16:23:59 -08:00
Eric Liang
e4bb5d8d16
Fix logging when ray cluster utils is used 2018-11-18 21:49:27 -08:00
Eric Liang
61e3bbbfee
Update stale example links 2018-11-17 15:40:38 -08:00
Robert Nishihara
5cbc597494 Suppress duplicate pre-emptive object pushes. (#3276)
* Suppress duplicate pre-emptive object pushes.

* Add test.

* Fix linting

* Remove timer and inline recent_pushes_ into local_objects_.

* Improve test.

* Fix

* Fix linting

* Enable retrying pull from same object manager. Randomize object manager.

* Speed up test

* Linting

* Add test.

* Minor

* Lengthen pull timeout and reissue pull every time a new object becomes available.

* Increase pull timeout in test.

* Wait for nodes to start in object manager test.

* Wait longer for nodes to start up in test.

* Small fixes.

* _submit -> _remote

* Change assert to warning.
2018-11-16 23:02:45 -08:00