Commit graph

2322 commits

Author SHA1 Message Date
Robert Nishihara
20b8b1d891 Add script for running stress tests. (#3378)
* Add script for running stress tests.

* Add an actor tree test where actors die with some probability

* Improve test.

* Small fix

* Update tests.

* Minor change
2018-11-27 04:28:02 -08:00
Eric Liang
e3c088fa1e
[rllib] PPO doesn't work with fractional num gpus (#3396)
* frac ppo

* gpu test
2018-11-27 01:14:10 -08:00
Eric Liang
aa94d3dd50 [autoscaler] Allow more than 5s from node creation to first heartbeat (#3385) 2018-11-26 17:25:05 -08:00
Robert Nishihara
0f0099fb90 UI changes, fix the task timeline and add the object transfer timeline to UI. (#3397)
* Saving

* Fix cmake and remove object/task search boxes.

* Add comment
2018-11-25 10:16:49 -08:00
Eric Liang
b85e7b43f3
[rllib] Refactor the sampler (#3387)
* refactor

* fix test

* add perf test

* Update sampler.py
2018-11-24 18:16:54 -08:00
Robert Nishihara
3856533065 Fix incompatibility with most recent version of Redis. (#3379)
* Fix incompatibility with most recent version of Redis.

* Fix

* Fixes.
2018-11-24 16:36:38 -08:00
Eric Liang
18a8dbfcfb [rllib] Clip DDPG ou-noise to avoid exceeding action bounds (#3386)
Closes #2965
2018-11-24 00:56:50 -08:00
Eric Liang
55fca828ce [rllib] Fix use_lstm option when using custom model with dict space (#3368)
## What do these changes do?

This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.

## Related issue number

Closes https://github.com/ray-project/ray/issues/3367
2018-11-23 22:51:08 -08:00
Eric Liang
8b76bab25c
[rllib] docs for td3 (#3381)
* td3 doc

* Update rllib-env.rst
2018-11-22 13:36:47 -08:00
Eric Liang
41b6b50d09 fix py3 (#3382) 2018-11-22 11:43:52 -08:00
GiliR4t1qbit
b9ae5edf74 When getting a role/profile, catch only exception that indicates the role/profile already exists, allow others to be raised (#3383) 2018-11-22 09:42:58 -08:00
Jones Wong
24bfe8ab76 Enable Twin Delayed DDPG for RLlib DDPG agent (#3353) 2018-11-21 20:03:20 -08:00
Stephanie Wang
6b3236349c
Fix memory leak in lineage cache (#3366)
* Move children_ map inside Lineage

* Update lineage_cache.cc

* Test and fixes

* Remove unused
2018-11-21 16:18:39 -08:00
Richard Liaw
784a6399b0
[tune] Node Fault Tolerance (#3238)
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.
2018-11-21 12:38:16 -08:00
Stephanie Wang
3e33f6f71b
Fix failure handling for actor death (#3359)
* Broadcast actor death, clean up dummy objects

* Reduce logging and clean up state when failing a task

* lint

* Make actor failure test nicer, reduce node timeout
2018-11-21 12:26:22 -08:00
Philipp Moritz
1a926c9b7c Fix $MACOSX_DEPLOYMENT_TARGET (#3337) 2018-11-21 10:56:17 -08:00
Eric Liang
686cf20951 Remove uses of std::list::size (#3358)
* worker pool and client conn

* Fix linting

* unordered set

* move
2018-11-20 14:47:55 -08:00
Richard Liaw
c24d87b4d1 [autoscaler] Submit command (#3312) 2018-11-20 14:03:34 -08:00
Philipp Moritz
d3697ce4e1
Ready queue refactor to make Dispatching tasks more efficient (#3324)
* put queues outside

* working version, still needs to be optimized

* implement round robin

* proper round robin

* fix spillback

* update

* fix

* cleanup

* more cleanups

* fix

* fix

* add documentation

* explanation for hash combiner

* speed it up

* cleanup and linting

* linting

* comments

* Update scheduling_queue.h

* temp commit

* fixes

* update

* fix

* cleanup

* cleanup

* lint

* more prints

* more prints

* increase sleep

* documentation

* sleep

* fix

* fix

* sleep longer

* update

* fix

* fix

* fix

* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* fixes

* use ordered set

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator

* fix

* fix test

* linting

* lint

* update

* add documentation

* linting
2018-11-20 13:14:12 -08:00
Ujval Misra
b0bfd104f2 Batch heartbeats from node manager together in the monitor. (#3011) 2018-11-20 09:52:27 -08:00
Eric Liang
abdc3b592e
[rllib] Update multi-gpu impala numbers (#3327) 2018-11-19 20:55:27 -08:00
Eric Liang
5972c29d28 [rllib] Set ape-x local exploration to 0, also load explorations before training steps (#3349)
## What do these changes do?

This should fix high explorations being used after restore / for rollouts.

## Related issue number

(dev list issue)
2018-11-19 20:36:25 -08:00
Eric Liang
afc48d7b77 Don't setpgid() on actors (#3347) 2018-11-19 17:35:26 -08:00
Robert Nishihara
f2b5500642 Add ordered_set container. (#3352)
* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator
2018-11-19 17:01:18 -08:00
Eric Liang
d4dbd27e0d Don't retry IPC connect an absurd number of times (#3355) 2018-11-19 16:23:59 -08:00
Eric Liang
e4bb5d8d16
Fix logging when ray cluster utils is used 2018-11-18 21:49:27 -08:00
Eric Liang
61e3bbbfee
Update stale example links 2018-11-17 15:40:38 -08:00
Robert Nishihara
5cbc597494 Suppress duplicate pre-emptive object pushes. (#3276)
* Suppress duplicate pre-emptive object pushes.

* Add test.

* Fix linting

* Remove timer and inline recent_pushes_ into local_objects_.

* Improve test.

* Fix

* Fix linting

* Enable retrying pull from same object manager. Randomize object manager.

* Speed up test

* Linting

* Add test.

* Minor

* Lengthen pull timeout and reissue pull every time a new object becomes available.

* Increase pull timeout in test.

* Wait for nodes to start in object manager test.

* Wait longer for nodes to start up in test.

* Small fixes.

* _submit -> _remote

* Change assert to warning.
2018-11-16 23:02:45 -08:00
Wenting Shen
ab1e0f5c2f support home path and relative path for temp-dir (#3329) 2018-11-16 17:41:10 -08:00
Robert Nishihara
60b22d9a72 Don't unsubscribe dependencies for infeasible tasks. (#3338)
* Make scheduling queues RemoveTasks return task states as well.

* Add test

* Don't unsubscribe for infeasible tasks when spilling over.

* Linting

* Address comments.
2018-11-16 11:33:00 -08:00
Eric Liang
e0bf9d7305 Add debug string to raylet (#3317)
* initial debug string

* format

* wip debug string

* fix compile

* fix

* update

* finished

* to file

* logs dir

* use temp root

* fix

* override
2018-11-15 21:47:50 -08:00
Robert Nishihara
d10cb570ab Rename _submit -> _remote. (#3321) 2018-11-15 15:30:18 -08:00
Robert Nishihara
98edf752a9 Note requirement cython==0.27.3 in installation instructions. (#3322) 2018-11-15 15:27:19 -08:00
Philipp Moritz
1be1455d86 Fix redis crash when duplicate messages are appended to log. (#3316) 2018-11-15 15:09:39 -08:00
Eric Liang
5723291db6 Raise exception if the node is nearly out of memory (#3323)
* wip

* add

* comment

* escape hatch

* update

* object store too

* .2
2018-11-15 12:55:25 -08:00
Philipp Moritz
b6a12d1f97 Fix socket retry message (#3325) 2018-11-15 12:14:19 -08:00
Lewis Belcher
5319fd044c Update redis version in setup.py (#3333)
* `redis` has released a new version (https://github.com/andymccurdy/redis-py/releases/tag/3.0.0)
* `ray` is not compatible with this version
* This PR adds the "compatible release" operator for `redis` version 2.10.6.
2018-11-15 10:40:08 -08:00
Eric Liang
706dc1d473
[rllib] Add test for multi-agent support and fix IMPALA multi-agent (#3289)
IMPALA support for multiagent was broken since IMPALA has a requirement that batch sizes be of a certain length. However multi-agent envs can create variable-length batches.

Fix this by adding zero-padding as needed (similar to the RNN case).
2018-11-14 14:14:07 -08:00
andrewztan
57c7b4238e KL Divergence Metrics (#3300)
* added KL divergence metrics

* fix
2018-11-13 23:12:35 -08:00
Eric Liang
1660c9d627
Kill actor child processes on shutdown (#3297)
* example

* add env

* test pg

* change to test

* add atexit test

* Update rllib-env.rst

* comment

* revert unnecessary file

* fix title when actor is idle

* Update python/ray/actor.py

Co-Authored-By: ericl <ekhliang@gmail.com>
2018-11-13 19:16:42 -08:00
Stephanie Wang
577c1dda74 Release sender connections as soon as WriteMessageAsync completes (#3313) 2018-11-13 21:32:24 -05:00
Wang Qing
9d4847ad2d [hot-fix] Fix error when calling Ray.init() twice. (#3314) 2018-11-13 21:21:54 -05:00
Eric Liang
65c27c70cf [rllib] Clean up agent resource configurations (#3296)
Closes #3284
2018-11-13 18:00:03 -08:00
Philipp Moritz
d4fad222e1 Update profiling instructions for raylet (#3311) 2018-11-13 17:48:33 -05:00
Richard Liaw
97f423781b Clean up Ray processes after cluster util exits (#3278) 2018-11-13 13:18:12 -08:00
Richard Liaw
c3a2c7ebed [tune] Doc: Autofilled, StatusReporter (#3294)
* autofill and revise doc page for things

* lint

* comments
2018-11-13 13:15:56 -08:00
Eric Liang
6ee7a3b571
[rllib] Raise worker TF intra_op threads to 2, lower driver intra_op threads to 8 (#3299) 2018-11-13 11:41:58 -08:00
Richard Liaw
c0423db05c [core] Add Global State Test for multi-node setting (#3239)
* add test for adding node

* multinode test fixes

* First pass at allowing updatable values

* Fix compilation issues

* Add config file parsing

* Full initialization

* Wrote a good test

* configuration parsing and stuff

* docs

* write some tests, make it good

* fixed init

* Add all config options and bring back stress tests.

* Update python/ray/worker.py

* Update python/ray/worker.py

* Fix internalization

* some last changes

* Linting and Java fix

* add docstring

* Fix test, add assertions

* pytest ext

* lint

* lint
2018-11-13 10:35:24 -08:00
Eric Liang
d90f365394 [rllib] Add self-supervised loss to model (#3291)
# What do these changes do?

Allow self-supervised losses to be easily defined in custom models. Add this to the reference policy graphs.
2018-11-12 18:55:24 -08:00
Philipp Moritz
ce6e01b988 enable incremental builds (#3292) 2018-11-12 21:49:09 -05:00