* add marvil policy graph
* fix typo
* add offline optimizer and enable running marwil
* fix loss function
* add maintaining the moving average of advantage norm
* use sync replay optimizer for unifying
* remove offline optimizer and use sync replay optimizer
* format by yapf
* add imitation learning objective
* fix according to eric's review
* format by yapf
* revise
* add test data
* marwil
## What do these changes do?
1. Fix the Jenkins test failure by add driver id to Actor GCS Key.
2. Move `object_manager_test.py` from Jenkins to Travis.
* mb impala
* fix
* paropt
* update
* cpu warn
* on cpu
* fix mb
* doc
* docs
* comment
* larger num
* early release
* remove grad clip
* only check loader count in multi gpu mode
* revert bad multigpu changes
* num sgd iter
* comment
* reuse optimizer
* add test
* par load test
* loosen test
* Update run_multi_node_tests.sh
* fix local mode
* Update agent.py
## What do these changes do?
This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.
## Related issue number
Closes https://github.com/ray-project/ray/issues/3367
This PR introduces single-node fault tolerance for Tune.
## Previous behavior:
- Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.
## New behavior:
- RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
- If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
- During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.
Remaining questions:
- Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).
- Waiting for some PRs to merge first (#3239)
Closes#2851.
* Suppress duplicate pre-emptive object pushes.
* Add test.
* Fix linting
* Remove timer and inline recent_pushes_ into local_objects_.
* Improve test.
* Fix
* Fix linting
* Enable retrying pull from same object manager. Randomize object manager.
* Speed up test
* Linting
* Add test.
* Minor
* Lengthen pull timeout and reissue pull every time a new object becomes available.
* Increase pull timeout in test.
* Wait for nodes to start in object manager test.
* Wait longer for nodes to start up in test.
* Small fixes.
* _submit -> _remote
* Change assert to warning.
IMPALA support for multiagent was broken since IMPALA has a requirement that batch sizes be of a certain length. However multi-agent envs can create variable-length batches.
Fix this by adding zero-padding as needed (similar to the RNN case).
* example
* add env
* test pg
* change to test
* add atexit test
* Update rllib-env.rst
* comment
* revert unnecessary file
* fix title when actor is idle
* Update python/ray/actor.py
Co-Authored-By: ericl <ekhliang@gmail.com>
This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly.
$ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \
--devices-per-worker=M --strategy=<simple|ps> \
--warmup --object-store-memory=10000000000
Images per second total
gpus total | simple | ps
========================================
1 | 218
2 (1 worker) | 388
4 (1 worker) | 759
4 (2 workers) | 176 | 623
8 (1 worker) | 985
8 (2 workers) | 349 | 1031
16 (2 nodes, 2 workers) | 600 | 1661
16 (2 nodes, 4 workers) | 468 | 1712 <--- OSDI perf was 1817
Add new search algorithm (genetic) along with the base framework of the searcher (which performs some basic jobs such as logging, recording and organizing in our project).
Note that this is the initial commit. In the following days, we will add example, UT, and other refinements.
It's possible to configure PPO in a way that ends up discarding most of the samples (they are treated as "stragglers"). Add a warning when this happens, and raise an exception if the waste is particularly egregious.
A bunch of minor rllib fixes:
pull in latest baselines atari wrapper changes (and use deepmind wrapper by default)
move reward clipping to policy evaluator
add a2c variant of a3c
reduce vision network fc layer size to 256 units
switch to 84x84 images
doc tweaks
print timesteps in tune status
Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer
Add AsyncSamplesOptimizer that implements the IMPALA architecture
integrate V-trace with a3c policy graph
audit V-trace integration
benchmark compare vs A3C and with V-trace on/off
PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.
The dict merge prevents crashes when tune is trying to get resource requests for agents and you override a config subkey. The min iter time prevents iterations from getting too small, incurring high overhead. This is easy to run into on Ape-X since throughput can get very high.
This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up.
It might be nice to share experience collection between the top-level trainers in the future.