Commit graph

528 commits

Author SHA1 Message Date
Eric Liang
303883a3b6 [rllib] [rfc] add contrib module and guideline for merging (#3565)
This adds guidelines for merging code into `rllib/contrib` vs `rllib/agents`. Also, clean up the agent import code to make registration easier.
2018-12-20 10:44:34 -08:00
Eric Liang
ffa6ee3ec8
[rllib] streaming minibatching for IMPALA (#3402)
* mb impala

* fix

* paropt

* update

* cpu warn

* on cpu

* fix mb

* doc

* docs

* comment

* larger num

* early release

* remove grad clip

* only check loader count in multi gpu mode

* revert bad multigpu changes

* num sgd iter

* comment

* reuse optimizer

* add test

* par load test

* loosen test

* Update run_multi_node_tests.sh

* fix local mode

* Update agent.py
2018-12-19 02:23:29 -08:00
Yuhong Guo
75ddf7cca4 Fix 2 small bugs (#3573) 2018-12-18 14:52:21 -05:00
Eric Liang
db0dee573e
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
Philipp Moritz
b3bf608608 Update arrow to reduce plasma IPCs. (#3497) 2018-12-14 23:49:37 -05:00
Stephanie Wang
fcc37021b2
Throw exception for ray.get of an evicted actor object (#3490)
* Add a flag for whether an object has been created before

* Add regression test

* doc

* Share object directory between object and node managers

* Treat evicted actor tasks as failed

* minor

* Check return value

* Fix bug where object locations weren't getting updated on client death

* Fix mac build

* Use RayTaskError
2018-12-14 11:41:27 -08:00
Yuhong Guo
a4abe6c0fe Add test to test raylet client connection when raylet crashes. (#3518) 2018-12-13 23:40:50 -08:00
Hao Chen
e7b51cbd1b [xray] Implement Actor Reconstruction (#3332)
* Implement Actor Reconstruction

* fix

* fix actor handle __del__

* fix lint

* add comment

* Remove actorCreationDummyObjectId

* address comments

* fix

* address comments

* avoid copy

* change log to debug

* fix error name
2018-12-13 21:28:58 -08:00
Si-Yuan
84fae57ab5 Convert the raylet client (the code in local_scheduler_client.cc) to proper C++. (#3511)
* refactoring

* fix bugs

* create client class

* create client class for java; bug fix

* remove legacy code

* improve code by using std::string, std::unique_ptr rename private fields and removing legacy code

* rename class

* improve naming

* fix

* rename files

* fix names

* change name

* change return types

* make a mutex private field

* fix comments

* fix bugs

* lint

* bug fix

* bug fix

* move too short functions into the header file

* Loose crash conditions for some APIs.

* Apply suggestions from code review

Co-Authored-By: suquark <suquark@gmail.com>

* format

* update

* rename python APIs

* fix java

* more fixes

* change types of cpython interface

* more fixes

* improve error processing

* improve error processing for java wrapper

* lint

* fix java

* make fields const

* use pointers for [out] parameters

* fix java & error msg

* fix resource leak, etc.
2018-12-13 13:39:10 -08:00
Eric Liang
0e00533ed4
Different approach to removing RayGetError (#3471) 2018-12-12 20:30:51 -08:00
Eric Liang
32473cf22e
[rllib] Basic Offline Data IO API (#3473) 2018-12-12 13:57:48 -08:00
Eric Liang
59f4743f20
[rllib] Run simple regressions tests for all algs in jenkins (#3498) 2018-12-11 17:21:53 -08:00
Richard Liaw
e0fbb68e47
[tune] Custom Logging, Trial Name (#3465)
Adds support for custom loggers, custom trial strings, and custom sync commands. Closes #3034, #2985, and #3390.
2018-12-11 13:41:59 -08:00
Yuhong Guo
abd781d607 Make stress test time shorter. (#3506) 2018-12-10 14:46:40 -05:00
Eric Liang
ce388a45cf
[rllib] Learner should not see clipped actions (#3496) 2018-12-09 21:57:11 -08:00
Eric Liang
cffe8f9806 Add option to evict keys LRU from the sharded redis tables (#3499)
* wip

* wip

* format

* wip

* note

* lint

* fix

* flag

* typo

* raise timeout

* fix

* optional get

* fix flag

* increase timeout in test

* update docs

* format
2018-12-09 05:48:52 -08:00
Yuhong Guo
b9e1977fae Fix failure of test_free_objects_multi_node (#3481)
It is possible that `test_free_objects_multi_node` would fail sometimes. If we run this test 20 times, we may found at least one failure.

The cause is that the test is based on function tasks. One raylet may create more than one worker to execute the tasks. So flush operations may be separated to several workers and not clean all the worker objects held by the plasma client.

In this PR, I change function task to actor tasks, which guarantee all the tasks are executed in one worker of a raylet.
2018-12-06 15:55:49 -05:00
shane
7a79b7f62c increase container memory and shm to 20G (#3475)
* increase container memory and shm to 20G

* variables are POWERFUL
2018-12-05 14:59:07 -08:00
Si-Yuan
2e6f9bedf2 Add the extra fallback for serialization (#3468)
* Add the extra fallback for serialization.

* Better comments & warnings. quotes.

* Update test/runtest.py

Co-Authored-By: suquark <suquark@gmail.com>

* Update test/runtest.py

Co-Authored-By: suquark <suquark@gmail.com>

* linting

* Don't hijack too much errors.

* simplify the test

* Update runtest.py

* simplify
2018-12-05 13:09:08 -08:00
Philipp Moritz
06f6431765 Make test_actor_multiple_gpus_from_multiple_tasks less stressful in travis 2018-12-04 17:44:33 -08:00
Eric Liang
13c8ce4d84 Update README.rst with 0.6.0 version number. (#3453) 2018-12-01 19:16:45 -08:00
Eric Liang
07d8cbf414
[rllib] Support batch norm layers (#3369)
* batch norm

* lint

* fix dqn/ddpg update ops

* bn model

* Update tf_policy_graph.py

* Update multi_gpu_impl.py

* Apply suggestions from code review

Co-Authored-By: ericl <ekhliang@gmail.com>
2018-11-29 13:33:39 -08:00
Stephanie Wang
48a5935224 Fault tolerance for actor creation (#3422)
* Add regression test

* Request actor creation if no actor location found

* Comments

* Address comments

* Increase test timeout

* Trigger test
2018-11-29 10:48:35 -08:00
Robert Nishihara
82863b5251
[autoscaler] Update autoscaler to use heartbeat batches. (#3409) 2018-11-27 23:46:27 -08:00
Eric Liang
f0df97db6f
[rllib] example and docs on how to use parametric actions with DQN / PG algorithms (#3384) 2018-11-27 23:35:19 -08:00
Robert Nishihara
20b8b1d891 Add script for running stress tests. (#3378)
* Add script for running stress tests.

* Add an actor tree test where actors die with some probability

* Improve test.

* Small fix

* Update tests.

* Minor change
2018-11-27 04:28:02 -08:00
Eric Liang
e3c088fa1e
[rllib] PPO doesn't work with fractional num gpus (#3396)
* frac ppo

* gpu test
2018-11-27 01:14:10 -08:00
Robert Nishihara
3856533065 Fix incompatibility with most recent version of Redis. (#3379)
* Fix incompatibility with most recent version of Redis.

* Fix

* Fixes.
2018-11-24 16:36:38 -08:00
Eric Liang
55fca828ce [rllib] Fix use_lstm option when using custom model with dict space (#3368)
## What do these changes do?

This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.

## Related issue number

Closes https://github.com/ray-project/ray/issues/3367
2018-11-23 22:51:08 -08:00
Stephanie Wang
6b3236349c
Fix memory leak in lineage cache (#3366)
* Move children_ map inside Lineage

* Update lineage_cache.cc

* Test and fixes

* Remove unused
2018-11-21 16:18:39 -08:00
Richard Liaw
784a6399b0
[tune] Node Fault Tolerance (#3238)
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.
2018-11-21 12:38:16 -08:00
Stephanie Wang
3e33f6f71b
Fix failure handling for actor death (#3359)
* Broadcast actor death, clean up dummy objects

* Reduce logging and clean up state when failing a task

* lint

* Make actor failure test nicer, reduce node timeout
2018-11-21 12:26:22 -08:00
Philipp Moritz
d3697ce4e1
Ready queue refactor to make Dispatching tasks more efficient (#3324)
* put queues outside

* working version, still needs to be optimized

* implement round robin

* proper round robin

* fix spillback

* update

* fix

* cleanup

* more cleanups

* fix

* fix

* add documentation

* explanation for hash combiner

* speed it up

* cleanup and linting

* linting

* comments

* Update scheduling_queue.h

* temp commit

* fixes

* update

* fix

* cleanup

* cleanup

* lint

* more prints

* more prints

* increase sleep

* documentation

* sleep

* fix

* fix

* sleep longer

* update

* fix

* fix

* fix

* Add ordered_set container.

* Fix

* Linting

* Constructors

* Remove O(n) call to list.size().

* fixes

* use ordered set

* Fix.

* Add documentation.

* Add iterators to ordered_set container implementation.

* iterator_type -> iterator

* Make typedefs private

* Add const_iterator

* fix

* fix test

* linting

* lint

* update

* add documentation

* linting
2018-11-20 13:14:12 -08:00
Ujval Misra
b0bfd104f2 Batch heartbeats from node manager together in the monitor. (#3011) 2018-11-20 09:52:27 -08:00
Robert Nishihara
5cbc597494 Suppress duplicate pre-emptive object pushes. (#3276)
* Suppress duplicate pre-emptive object pushes.

* Add test.

* Fix linting

* Remove timer and inline recent_pushes_ into local_objects_.

* Improve test.

* Fix

* Fix linting

* Enable retrying pull from same object manager. Randomize object manager.

* Speed up test

* Linting

* Add test.

* Minor

* Lengthen pull timeout and reissue pull every time a new object becomes available.

* Increase pull timeout in test.

* Wait for nodes to start in object manager test.

* Wait longer for nodes to start up in test.

* Small fixes.

* _submit -> _remote

* Change assert to warning.
2018-11-16 23:02:45 -08:00
Robert Nishihara
60b22d9a72 Don't unsubscribe dependencies for infeasible tasks. (#3338)
* Make scheduling queues RemoveTasks return task states as well.

* Add test

* Don't unsubscribe for infeasible tasks when spilling over.

* Linting

* Address comments.
2018-11-16 11:33:00 -08:00
Robert Nishihara
d10cb570ab Rename _submit -> _remote. (#3321) 2018-11-15 15:30:18 -08:00
Philipp Moritz
1be1455d86 Fix redis crash when duplicate messages are appended to log. (#3316) 2018-11-15 15:09:39 -08:00
Eric Liang
706dc1d473
[rllib] Add test for multi-agent support and fix IMPALA multi-agent (#3289)
IMPALA support for multiagent was broken since IMPALA has a requirement that batch sizes be of a certain length. However multi-agent envs can create variable-length batches.

Fix this by adding zero-padding as needed (similar to the RNN case).
2018-11-14 14:14:07 -08:00
Eric Liang
1660c9d627
Kill actor child processes on shutdown (#3297)
* example

* add env

* test pg

* change to test

* add atexit test

* Update rllib-env.rst

* comment

* revert unnecessary file

* fix title when actor is idle

* Update python/ray/actor.py

Co-Authored-By: ericl <ekhliang@gmail.com>
2018-11-13 19:16:42 -08:00
Eric Liang
65c27c70cf [rllib] Clean up agent resource configurations (#3296)
Closes #3284
2018-11-13 18:00:03 -08:00
Richard Liaw
97f423781b Clean up Ray processes after cluster util exits (#3278) 2018-11-13 13:18:12 -08:00
Eric Liang
bd0dbde149
[rllib] Rename ServingEnv => ExternalEnv (#3302) 2018-11-12 16:31:27 -08:00
Eric Liang
53489d2f85
[sgd] Document and add simple MNIST example (#3236) 2018-11-10 21:52:20 -08:00
Stephanie Wang
d950e92f63
Allow multiple threads to call ray.get and ray.wait (#3244)
* Handle multiple threads calling ray.get

* Multithreaded ray.wait

* Pass in current task ID in java backend

* Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get

* Fix test

* Some cleanups

* Improve error message

* Add assertion

* Cleanup, throw error in HandleTaskUnblocked if task not actually blocked

* lint

* Fix python worker reset

* Fix references to reconstruct_objects

* Linting

* java lint

* Fix java

* Fix iterator
2018-11-07 22:39:28 -08:00
Richard Liaw
0bab8ed95c
Expose internal config parameters for starting Ray (#3246)
## What do these changes do?

This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly.

Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible.

#3239 depends on this.

TODO:
 - [x] Add documentation to method arguments before merging.
 - [x] Add test to verify this works?

## Related issue number
2018-11-07 21:46:02 -08:00
Robert Nishihara
1dd5d92789 Enable timeline visualizations of object transfers. (#3255)
* Plot object transfers.

* Linting
2018-11-07 12:45:59 -08:00
Eric Liang
725df3a485 Set the process title in workers and actors (#3219) 2018-11-06 14:59:22 -08:00
Stephanie Wang
bf88aa5013
Increase timeout before reconstruction is triggered (#3217)
* Increase timeout to 10s

* Skip eviction reconstruction tests

* Add stress test for many actors to one

* Fix test by shortening it.

* lower number of processes in stress test

* Skip slow test
2018-11-05 18:03:50 -08:00
Eric Liang
813f51769f [rllib] Fix rllib rollouts script and add test (#3211)
## What do these changes do?

Clean up the checkpointing to handle the new checkpoint dirs. Add a test for rollout.py

## Related issue number

https://github.com/ray-project/ray/issues/3206
https://github.com/ray-project/ray/issues/3204
2018-11-05 00:33:25 -08:00