This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly.
$ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \
--devices-per-worker=M --strategy=<simple|ps> \
--warmup --object-store-memory=10000000000
Images per second total
gpus total | simple | ps
========================================
1 | 218
2 (1 worker) | 388
4 (1 worker) | 759
4 (2 workers) | 176 | 623
8 (1 worker) | 985
8 (2 workers) | 349 | 1031
16 (2 nodes, 2 workers) | 600 | 1661
16 (2 nodes, 4 workers) | 468 | 1712 <--- OSDI perf was 1817
Add new search algorithm (genetic) along with the base framework of the searcher (which performs some basic jobs such as logging, recording and organizing in our project).
Note that this is the initial commit. In the following days, we will add example, UT, and other refinements.
It's possible to configure PPO in a way that ends up discarding most of the samples (they are treated as "stragglers"). Add a warning when this happens, and raise an exception if the waste is particularly egregious.
A bunch of minor rllib fixes:
pull in latest baselines atari wrapper changes (and use deepmind wrapper by default)
move reward clipping to policy evaluator
add a2c variant of a3c
reduce vision network fc layer size to 256 units
switch to 84x84 images
doc tweaks
print timesteps in tune status
Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer
Add AsyncSamplesOptimizer that implements the IMPALA architecture
integrate V-trace with a3c policy graph
audit V-trace integration
benchmark compare vs A3C and with V-trace on/off
PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.
The dict merge prevents crashes when tune is trying to get resource requests for agents and you override a config subkey. The min iter time prevents iterations from getting too small, incurring high overhead. This is easy to run into on Ape-X since throughput can get very high.
This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up.
It might be nice to share experience collection between the top-level trainers in the future.
Cleanup: TFPolicyGraph now automatically adds loss input entries for state_in_*, so that graph sub-classes don't need to worry about it.
Multi-GPU support:
Allow setting up model tower replicas with existing state input tensors
Truncate the per-device minibatch slices so that they are always a multiple of max_seq_len.
## What do these changes do?
**Vectorized envs**: Users can either implement `VectorEnv`, or alternatively set `num_envs=N` to auto-vectorize gym envs (this vectorizes just the action computation part).
```
# CartPole-v0 on single core with 64x64 MLP:
# vector_width=1:
Actions per second 2720.1284458322966
# vector_width=8:
Actions per second 13773.035334888269
# vector_width=64:
Actions per second 37903.20472563333
```
**Async envs**: The more general form of `VectorEnv` is `AsyncVectorEnv`, which allows agents to execute out of lockstep. We use this as an adapter to support `ServingEnv`. Since we can convert any other form of env to `AsyncVectorEnv`, utils.sampler has been rewritten to run against this interface.
**Policy serving**: This provides an env which is not stepped. Rather, the env executes in its own thread, querying the policy for actions via `self.get_action(obs)`, and reporting results via `self.log_returns(rewards)`. We also support logging of off-policy actions via `self.log_action(obs, action)`. This is a more convenient API for some use cases, and also provides parallelizable support for policy serving (for example, if you start a HTTP server in the env) and ingest of offline logs (if the env reads from serving logs).
Any of these types of envs can be passed to RLlib agents. RLlib handles conversions internally in CommonPolicyEvaluator, for example:
```
gym.Env => rllib.VectorEnv => rllib.AsyncVectorEnv
rllib.ServingEnv => rllib.AsyncVectorEnv
```
* Use F.softmax instead of a pointless network layer
Stateless functions should not be network layers.
* Use correct pytorch functions
* Rename argument name to out_size
Matches in_size and makes more sense.
* Fix shapes of tensors
Advantages and rewards both should be scalars, and therefore a list of them
should be 1D.
* Fmt
* replace deprecated function
* rm unnecessary Variable wrapper
* rm all use of torch Variables
Torch does this for us now.
* Ensure that values are flat list
* Fix shape error in conv nets
* fmt
* Fix shape errors
Reshaping the action before stepping in the env fixes a few errors.
* Add TODO
* Use correct filter size
Works when `self.config['model']['channel_major'] = True`.
* Add missing channel major
* Revert reshape of action
This should be handled by the agent or at least in a cleaner way that doesn't
break existing envs.
* Squeeze action
* Squeeze actions along first dimension
This should deal with some cases such as cartpole where actions are scalars
while leaving alone cases where actions are arrays (some robotics tasks).
* try adding pytorch tests
* typo
* fixup docker messages
* Fix A3C for some envs
Pendulum doesn't work since it's an edge case (expects singleton arrays, which
`.squeeze()` collapses to scalars).
* fmt
* nit flake
* small lint
* Fri Feb 16 13:53:50 PST 2018
* Sat Feb 17 15:32:08 PST 2018
* Sat Feb 17 15:44:59 PST 2018
* fix
* Sun Feb 18 14:46:24 PST 2018
* Sun Feb 18 14:46:37 PST 2018
* Sun Feb 18 14:55:52 PST 2018
* Sun Feb 18 15:14:32 PST 2018
* Wed Feb 21 17:34:17 PST 2018
* Sun Feb 25 17:51:17 PST 2018
* Sun Feb 25 22:18:40 PST 2018
* Wed Feb 28 13:19:05 PST 2018
* Wed Feb 28 13:22:13 PST 2018
* Wed Feb 28 13:33:29 PST 2018
* Wed Feb 28 13:35:33 PST 2018
* add ex
* Fri Mar 2 12:50:17 PST 2018
* Fri Mar 2 12:54:31 PST 2018
* patch up pbt
* Sat Jan 27 01:00:03 PST 2018
* Sat Jan 27 01:04:14 PST 2018
* Sat Jan 27 01:04:21 PST 2018
* Sat Jan 27 01:15:15 PST 2018
* Sat Jan 27 01:15:42 PST 2018
* Sat Jan 27 01:16:14 PST 2018
* Sat Jan 27 01:38:42 PST 2018
* Sat Jan 27 01:39:21 PST 2018
* add pbt
* Sat Jan 27 01:41:19 PST 2018
* Sat Jan 27 01:44:21 PST 2018
* Sat Jan 27 01:45:46 PST 2018
* Sat Jan 27 16:54:42 PST 2018
* Sat Jan 27 16:57:53 PST 2018
* clean up test
* Sat Jan 27 18:01:15 PST 2018
* Sat Jan 27 18:02:54 PST 2018
* Sat Jan 27 18:11:18 PST 2018
* Sat Jan 27 18:11:55 PST 2018
* Sat Jan 27 18:14:09 PST 2018
* review
* try out a ppo example
* some tweaks to ppo example
* add postprocess hook
* Sun Jan 28 15:00:40 PST 2018
* clean up custom explore fn
* Sun Jan 28 15:10:21 PST 2018
* Sun Jan 28 15:14:53 PST 2018
* Sun Jan 28 15:17:04 PST 2018
* Sun Jan 28 15:33:13 PST 2018
* Sun Jan 28 15:56:40 PST 2018
* Sun Jan 28 15:57:36 PST 2018
* Sun Jan 28 16:00:35 PST 2018
* Sun Jan 28 16:02:58 PST 2018
* Sun Jan 28 16:29:50 PST 2018
* Sun Jan 28 16:30:36 PST 2018
* Sun Jan 28 16:31:44 PST 2018
* improve tune doc
* concepts
* update humanoid
* Fri Feb 2 18:03:33 PST 2018
* fix example
* show error file
* working multi action distribution and multiagent model
* currently working but the splits arent done in the right place
* added shared models
* added categorical support and mountain car example
* now compatible with generalized advantage estimation
* working multiagent code with discrete and continuous example
* moved reshaper to utils
* code review changes made, ppo action placeholder moved to model catalog, all multiagent code moved out of fcnet
* added examples in
* added PEP8 compliance
* examples are mostly pep8 compliant
* removed all flake errors
* added examples to jenkins tests
* fixed custom options bug
* added lines to let docker file find multiagent tests
* shortened example run length
* corrected nits
* fixed flake errors