Commit graph

435 commits

Author SHA1 Message Date
Robert Nishihara
eda6ebb87d Convert some unittests to pytest. (#2779)
* Convert multi_node_test.py to pytest.

* Convert array_test.py to pytest.

* Convert failure_test.py to pytest.

* Convert microbenchmarks to pytest.

* Convert component_failures_test.py to pytest and some minor quotes changes.

* Convert tensorflow_test.py to pytest.

* Convert actor_test.py to pytest.

* Fix.

* Fix
2018-08-31 11:24:15 -07:00
Richard Liaw
0347e6418b
[tune] Add PyTorch MNIST Example + Misc. Tweaks (#2708) 2018-08-30 16:18:56 -07:00
Robert Nishihara
32f7d6fcf5 Add back some tests for xray. (#2772) 2018-08-30 11:07:23 -07:00
Robert Nishihara
132f133214 Limit number of concurrent workers started by hardware concurrency. (#2753)
* Limit number of concurrent workers started by hardware concurrency.

* Check if std:🧵:hardware_concurrency() returns 0.

* Pass in max concurrency from Python.

* Fix Java call to startRaylet.

* Fix typo

* Remove unnecessary cast.

* Fix linting.

* Cleanups on Java side.

* Comment back in actor test.

* Require maximum_startup_concurrency to be at least 1.

* Fix linting and test.

* Improve documentation.

* Fix typo.
2018-08-29 14:53:40 +08:00
Robert Nishihara
b7722897b4 Deprecate 'driver_mode' argument. (#2758)
* Deprecate 'driver_mode' argument.

* Fix

* Fix
2018-08-28 16:45:49 -07:00
Alexey Tumanov
de047daea7 [xray] raylet scheduling mechanism with a simple spillback policy (#2749)
## What do these changes do?
* distribute load and resource information on a heartbeat
* for each raylet, maintain total and available resource capacity as well as measure of current load
* this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load.
* modify the scheduling policy to perform *capacity-based*, *load-aware*, *optimistically concurrent* resource allocation
* perform task spillover to the heartbeating node in response to a heartbeat, implementing  heterogeneity-aware late-binding/work-stealing.
2018-08-28 00:03:34 -07:00
Richard Liaw
dbba7f2a53
[autoscaler] Cleanup Logging (#2709)
Moves Autoscaler onto Python `logging` module.
2018-08-25 17:08:45 -07:00
Eric Liang
fbe6c59f72
[rllib] Misc fixes, A2C (#2679)
A bunch of minor rllib fixes:

pull in latest baselines atari wrapper changes (and use deepmind wrapper by default)
move reward clipping to policy evaluator
add a2c variant of a3c
reduce vision network fc layer size to 256 units
switch to 84x84 images
doc tweaks
print timesteps in tune status
2018-08-20 15:28:03 -07:00
Robert Nishihara
aaf5456b3d Add test that tasks sent to actor on dead node raise exceptions. (#2626)
* Add actor failure test.

* Minor change.

* Make test harder.

* Change numbers a bit.

* Skip test for non xray.
2018-08-16 22:48:31 -07:00
Eric Liang
6670880f03
[rllib] Workaround actor creation hang edge case for ape-X (#2661)
* apex hang

* fix

* move pyt to end
2018-08-16 18:03:50 -07:00
Yuhong Guo
eeb15771ba Add ray.internal.free (#2542) 2018-08-14 22:01:23 -07:00
Stephanie Wang
806fdf2f05 [xray] Object manager retries Pull requests (#2630)
* Move all ObjectManager members to bottom of class def

* Better Pull requests
- suppress duplicate Pulls
- retry the Pull at the next client after a timeout
- cancel a Pull if the object no longer appears on any clients

* increase object manager Pull timeout

* Make the component failure test harder.

* note

* Notify SubscribeObjectLocations caller of empty list

* Address melih's comments

* Fix wait...

* Make component failure test easier for legacy ray

* lint
2018-08-13 19:15:55 -07:00
Stephanie Wang
4a7be6f46d [xray] Make sure raylet does not crash if remote raylet dies (#2619)
* Log a warning on remote object manager failures

* Mark a task that was failed to be forwarded as pending

* Raylet component failure test and make it harder

* Turn on component failure test for xray

* Remove return status from ReleaseSender

* lint
2018-08-09 20:36:30 -07:00
Stephanie Wang
d49b4bef0a [xray] Basic task reconstruction mechanism (#2526)
## What do these changes do?

This implements basic task reconstruction in raylet. There are two parts to this PR:
1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary.
2. Task resubmission once a raylet becomes responsible for reconstructing a task.

Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this:
1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR.
2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted).

Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.
2018-08-09 07:24:37 -07:00
Melih Elibol
8ae82180b4 [xray] Adds a driver table. (#2289)
This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death.

Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.
2018-08-08 23:41:40 -07:00
Stephanie Wang
6ab01a2cad [xray] Fix bug when counting a task's lineage size (#2600) 2018-08-08 00:00:17 -07:00
Yuhong Guo
9825da7233 Change training tasks to xray for Jenkins tests (#2567) 2018-08-06 13:35:26 -07:00
Yuhong Guo
d2ebe4d9a3 Fix frequent failure of Jenkins CI. (#2490) 2018-08-02 10:28:28 -07:00
Philipp Moritz
d8ba667175 Convert asserts in unittest to pytest (#2529) 2018-08-01 22:32:10 -07:00
Eric Liang
9ea57c2a93
[rllib] Basic IMPALA implementation (using deepmind's reference vtrace.py) (#2504)
Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer
  Add AsyncSamplesOptimizer that implements the IMPALA architecture
  integrate V-trace with a3c policy graph
  audit V-trace integration
  benchmark compare vs A3C and with V-trace on/off
PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.
2018-08-01 20:53:53 -07:00
Robert Nishihara
909d7172b1 Introduce constant for ID_SIZE in python code. (#2517) 2018-07-31 12:40:53 -07:00
Eric Liang
38d00986a5
[rllib] Cleanups: deep merge configs properly; enforce min iter time on APEX (#2500)
The dict merge prevents crashes when tune is trying to get resource requests for agents and you override a config subkey. The min iter time prevents iterations from getting too small, incurring high overhead. This is easy to run into on Ape-X since throughput can get very high.
2018-07-30 13:25:35 -07:00
Philipp Moritz
696a229ece Fix text verbosity in python 2.7 by running tests with pytest (#2470) 2018-07-30 11:04:06 -07:00
Robert Nishihara
2be1ccbd8f Raise application-level exceptions for some failure scenarios. (#2429)
* Raise application level exception for actor methods that can't be executed and failed tasks.

* Retry task forwarding for actor tasks.

* Small cleanups

* Move constant to ray_config.

* Create ForwardTaskOrResubmit method.

* Minor

* Clean up queued tasks for dead actors.

* Some cleanups.

* Linting

* Notify task_dependency_manager_ about failed tasks.

* Manage timer lifetime better.

* Use smart pointers to deallocate the timer.

* Fix

* add comment
2018-07-27 19:53:30 -04:00
Shuo
29451cca82 Add test: running a driver for twice. (#2464) 2018-07-27 00:57:52 -07:00
Eric Liang
68660453e4
[rllib] Better support and add two-trainer example for multiagent (#2443)
This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up.

It might be nice to share experience collection between the top-level trainers in the future.
2018-07-22 05:09:25 -07:00
Hao Chen
05f485e274 Allow Ray API to be used from multiple threads (#2422) 2018-07-20 15:39:01 -07:00
Eric Liang
807f309b3a
[test] Fix broken rllib test (#2446)
This fixes the broken build.
2018-07-20 13:47:41 -07:00
Eric Liang
8e75d150f7
[rllib] Apex crash when compress_observations: False (#2426)
We shouldn't try to decompress uncompressed data.

Also, fix resource requests for ddpg + GPU.
2018-07-19 15:58:09 -07:00
Richard Liaw
8e8c733696
[tune] Fix Categorical Space + Add Keras Example (#2401)
Previously did not properly resolve categorical variables for HyperOpt.
2018-07-17 23:52:52 +02:00
Eric Liang
0cecf6b79c
[rllib] Cleanup RNN support and make it work with multi-GPU optimizer (#2394)
Cleanup: TFPolicyGraph now automatically adds loss input entries for state_in_*, so that graph sub-classes don't need to worry about it.

Multi-GPU support:

Allow setting up model tower replicas with existing state input tensors

Truncate the per-device minibatch slices so that they are always a multiple of max_seq_len.
2018-07-17 06:55:46 +02:00
Robert Nishihara
515da7721a Change ray.worker.cleanup -> ray.shutdown and improve API documentation. (#2374)
* Change ray.worker.cleanup -> ray.shutdown and improve API documentation.

* Deprecate ray.worker.cleanup() gracefully.

* Fix linting
2018-07-12 12:00:00 -07:00
Eric Liang
b316afeb43 [rllib] Add debug info back to PPO and fix optimizer compatibility (#2366) 2018-07-12 19:22:46 +02:00
Richard Liaw
0048e77093
[rllib] RLlib CLI (#2375) 2018-07-12 19:12:04 +02:00
Robert Nishihara
54487b1d7f Pin the number of CPUs in failing actor test. (#2368)
* Pin the number of CPUs in failing actor test.

* Pin number of CPUs in multi_node_test.py.

* Fix linting.
2018-07-11 18:34:19 -07:00
Richard Liaw
4d7da9f668
[rllib] Remove "Common", cleanup some code (#2348) 2018-07-08 13:03:53 -07:00
Robert Nishihara
e3534c46df [xray] Re-enable some stress tests and convert stress_tests to pytest. (#2285)
* Fix one of the stress tests, fix ray.global_state.client_table when called early on.

* Re-enable testWait.

* Convert stress_tests.py to pytest.

* Fix
2018-07-06 23:21:00 -07:00
Robert Nishihara
b90e551b41 [xray] Implement timeline and profiling API. (#2306)
* Add profile table and store profiling information there.

* Code for dumping timeline.

* Improve color scheme.

* Push timeline events on driver only for raylet.

* Improvements to profiling and timeline visualization

* Some linting

* Small fix.

* Linting

* Propagate node IP address through profiling events.

* Fix test.

* object_id.hex() should return byte string in python 2.

* Include gcs.fbs in node_manager.fbs.

* Remove flatbuffer definition duplication.

* Decode to unicode in Python 3 and bytes in Python 2.

* Minor

* Submit profile events in a batch. Revert some CMake changes.

* Fix

* Workaround test failure.

* Fix linting

* Linting

* Don't return anything from chrome_tracing_dump when filename is provided.

* Remove some redundancy from profile table.

* Linting

* Move TODOs out of docstring.

* Minor
2018-07-04 23:23:48 -07:00
Richard Liaw
f0ed1c1674
[rllib] Add more regression tests and autogenerate (#2324) 2018-07-02 08:20:53 -07:00
Eric Liang
8aa56c12e6
[rllib] Document "v2" APIs (#2316)
* re

* wip

* wip

* a3c working

* torch support

* pg works

* lint

* rm v2

* consumer id

* clean up pg

* clean up more

* fix python 2.7

* tf session management

* docs

* dqn wip

* fix compile

* dqn

* apex runs

* up

* impotrs

* ddpg

* quotes

* fix tests

* fix last r

* fix tests

* lint

* pass checkpoint restore

* kwar

* nits

* policy graph

* fix yapf

* com

* class

* pyt

* vectorization

* update

* test cpe

* unit test

* fix ddpg2

* changes

* wip

* args

* faster test

* common

* fix

* add alg option

* batch mode and policy serving

* multi serving test

* todo

* wip

* serving test

* doc async env

* num envs

* comments

* thread

* remove init hook

* update

* fix ppo

* comments1

* fix

* updates

* add jenkins tests

* fix

* fix pytorch

* fix

* fixes

* fix a3c policy

* fix squeeze

* fix trunc on apex

* fix squeezing for real

* update

* remove horizon test for now

* multiagent wip

* update

* fix race condition

* fix ma

* t

* doc

* st

* wip

* example

* wip

* working

* cartpole

* wip

* batch wip

* fix bug

* make other_batches None default

* working

* debug

* nit

* warn

* comments

* fix ppo

* fix obs filter

* update

* wip

* tf

* update

* fix

* cleanup

* cleanup

* spacing

* model

* fix

* dqn

* fix ddpg

* doc

* keep names

* update

* fix

* com

* docs

* clarify model outputs

* Update torch_policy_graph.py

* fix obs filter

* pass thru worker index

* fix

* rename

* vlad torch comments

* fix log action

* debug name

* fix lstm

* remove unused ddpg net

* remove conv net

* revert lstm

* wip

* wip

* cast

* wip

* works

* fix a3c

* works

* lstm util test

* doc

* clean up

* update

* fix lstm check

* move to end

* fix sphinx

* fix cmd

* remove bad doc

* envs

* vec

* doc prep

* models

* rl

* alg

* up

* clarify

* copy

* async sa

* fix

* comments

* fix a3c conf

* tune lstm

* fix reshape

* fix

* back to 16

* tuned a3c update

* update

* tuned

* optional

* merge

* wip

* fix up

* move pg class

* rename env

* wip

* update

* tip

* alg

* readme

* fix catalog

* readme

* doc

* context

* remove prep

* comma

* add env

* link to paper

* paper

* update

* rnn

* update

* wip

* clean up ev creation

* fix

* fix

* fix

* fix lint

* up

* no comma

* ma

* Update run_multi_node_tests.sh

* fix

* sphinx is stupid

* sphinx is stupid

* clarify torch graph

* no horizon

* fix config

* sb

* Update test_optimizers.py
2018-07-01 00:05:08 -07:00
Philipp Moritz
762bdf646e [xray] Put GCS data into the redis data shard (#2298) 2018-06-30 15:42:10 -10:00
Richard Liaw
3cc27d2840
[rllib][asv] Support ASV for RLlib (#2304) 2018-06-28 17:20:09 -07:00
Adam Gleave
89460b8d11 autoscaler: count head node, don't kill below target (fixes #2317) (#2320)
Specifically, subtracts 1 from the target number of workers, taking into
account that the head node has some computational resources.

Do not kill an idle node if it would drop us below the target number of
nodes (in which case we just immediately relaunch).
2018-06-28 15:33:51 -07:00
Eric Liang
b197c0c404
[rllib] General RNN support (#2299)
* wip

* cls

* re

* wip

* wip

* a3c working

* torch support

* pg works

* lint

* rm v2

* consumer id

* clean up pg

* clean up more

* fix python 2.7

* tf session management

* docs

* dqn wip

* fix compile

* dqn

* apex runs

* up

* impotrs

* ddpg

* quotes

* fix tests

* fix last r

* fix tests

* lint

* pass checkpoint restore

* kwar

* nits

* policy graph

* fix yapf

* com

* class

* pyt

* vectorization

* update

* test cpe

* unit test

* fix ddpg2

* changes

* wip

* args

* faster test

* common

* fix

* add alg option

* batch mode and policy serving

* multi serving test

* todo

* wip

* serving test

* doc async env

* num envs

* comments

* thread

* remove init hook

* update

* fix ppo

* comments1

* fix

* updates

* add jenkins tests

* fix

* fix pytorch

* fix

* fixes

* fix a3c policy

* fix squeeze

* fix trunc on apex

* fix squeezing for real

* update

* remove horizon test for now

* multiagent wip

* update

* fix race condition

* fix ma

* t

* doc

* st

* wip

* example

* wip

* working

* cartpole

* wip

* batch wip

* fix bug

* make other_batches None default

* working

* debug

* nit

* warn

* comments

* fix ppo

* fix obs filter

* update

* wip

* tf

* update

* fix

* cleanup

* cleanup

* spacing

* model

* fix

* dqn

* fix ddpg

* doc

* keep names

* update

* fix

* com

* docs

* clarify model outputs

* Update torch_policy_graph.py

* fix obs filter

* pass thru worker index

* fix

* rename

* vlad torch comments

* fix log action

* debug name

* fix lstm

* remove unused ddpg net

* remove conv net

* revert lstm

* wip

* wip

* cast

* wip

* works

* fix a3c

* works

* lstm util test

* doc

* clean up

* update

* fix lstm check

* move to end

* fix sphinx

* fix cmd

* remove bad doc

* clarify

* copy

* async sa

* fix

* comments

* fix a3c conf

* tune lstm

* fix reshape

* fix

* back to 16

* tuned a3c update

* update

* tuned

* optional

* fix catalog

* remove prep
2018-06-27 22:51:04 -07:00
Eric Liang
1251abf0d1
[rllib] Modularize Torch and TF policy graphs (#2294)
* wip

* cls

* re

* wip

* wip

* a3c working

* torch support

* pg works

* lint

* rm v2

* consumer id

* clean up pg

* clean up more

* fix python 2.7

* tf session management

* docs

* dqn wip

* fix compile

* dqn

* apex runs

* up

* impotrs

* ddpg

* quotes

* fix tests

* fix last r

* fix tests

* lint

* pass checkpoint restore

* kwar

* nits

* policy graph

* fix yapf

* com

* class

* pyt

* vectorization

* update

* test cpe

* unit test

* fix ddpg2

* changes

* wip

* args

* faster test

* common

* fix

* add alg option

* batch mode and policy serving

* multi serving test

* todo

* wip

* serving test

* doc async env

* num envs

* comments

* thread

* remove init hook

* update

* fix ppo

* comments1

* fix

* updates

* add jenkins tests

* fix

* fix pytorch

* fix

* fixes

* fix a3c policy

* fix squeeze

* fix trunc on apex

* fix squeezing for real

* update

* remove horizon test for now

* multiagent wip

* update

* fix race condition

* fix ma

* t

* doc

* st

* wip

* example

* wip

* working

* cartpole

* wip

* batch wip

* fix bug

* make other_batches None default

* working

* debug

* nit

* warn

* comments

* fix ppo

* fix obs filter

* update

* wip

* tf

* update

* fix

* cleanup

* cleanup

* spacing

* model

* fix

* dqn

* fix ddpg

* doc

* keep names

* update

* fix

* com

* docs

* clarify model outputs

* Update torch_policy_graph.py

* fix obs filter

* pass thru worker index

* fix

* rename

* vlad torch comments

* fix log action

* debug name

* fix lstm

* remove unused ddpg net

* remove conv net

* revert lstm

* cast

* clean up

* fix lstm check

* move to end

* fix sphinx

* fix cmd

* remove bad doc

* clarify

* copy

* async sa

* fix
2018-06-26 13:17:15 -07:00
Eric Liang
a9a26b7560
[rllib] Part 2 of multiagent support (#2286)
* wip

* cls

* re

* wip

* wip

* a3c working

* torch support

* pg works

* lint

* rm v2

* consumer id

* clean up pg

* clean up more

* fix python 2.7

* tf session management

* docs

* dqn wip

* fix compile

* dqn

* apex runs

* up

* impotrs

* ddpg

* quotes

* fix tests

* fix last r

* fix tests

* lint

* pass checkpoint restore

* kwar

* nits

* policy graph

* fix yapf

* com

* class

* pyt

* vectorization

* update

* test cpe

* unit test

* fix ddpg2

* changes

* wip

* args

* faster test

* common

* fix

* add alg option

* batch mode and policy serving

* multi serving test

* todo

* wip

* serving test

* doc async env

* num envs

* comments

* thread

* remove init hook

* update

* fix ppo

* comments1

* fix

* updates

* add jenkins tests

* fix

* fix pytorch

* fix

* fixes

* fix a3c policy

* fix squeeze

* fix trunc on apex

* fix squeezing for real

* update

* remove horizon test for now

* multiagent wip

* update

* fix race condition

* fix ma

* t

* doc

* st

* wip

* example

* wip

* working

* cartpole

* wip

* batch wip

* fix bug

* make other_batches None default

* working

* debug

* nit

* warn

* comments

* fix ppo

* fix obs filter

* update

* fix obs filter

* pass thru worker index

* fix

* fix log action

* debug name

* fix sphinx
2018-06-25 22:33:57 -07:00
Robert Nishihara
800f7cc77d Make actor handles work in Python mode. (#2283)
* Make actor handles work in local mode.

* Add test for actor handles in local mode.
2018-06-20 23:02:41 -07:00
Robert Nishihara
ff2217251f [xray] Add error table and push error messages to driver through node manager. (#2256)
* Fix documentation indentation.

* Add error table to GCS and push error messages through node manager.

* Add type to error data.

* Linting

* Fix failure_test bug.

* Linting.

* Enable one more test.

* Attempt to fix doc building.

* Restructuring

* Fixes

* More fixes.

* Move current_time_ms function into util.h.
2018-06-20 21:29:28 -07:00
Robert Nishihara
18ee044f03 Re-enable some actor tests. (#2276) 2018-06-20 14:42:35 -07:00
Zongheng Yang
8190ff1fd0 Experimental: enable automatic GCS flushing with configurable policy. (#2266)
* build_credis.sh: use an up-to-date credis commit.

* build_credis.sh: leveldb is updated, so update build cmds for it

* WIP: make monitor.py issue flush; switch gcs client to use credis

* Experimental: enable automatic GCS flushing with configurable policy.

* Fix linux compilation error

* Fix leveldb build

* Use optimized build for credis

* Address comments

* Attempt to fix tests
2018-06-20 14:40:57 -07:00