Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer
Add AsyncSamplesOptimizer that implements the IMPALA architecture
integrate V-trace with a3c policy graph
audit V-trace integration
benchmark compare vs A3C and with V-trace on/off
PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.
This also removes the async resetting code in VectorEnv. While that improves benchmark performance slightly, it substantially complicates env configuration and probably isn't worth it for most envs.
This makes it easy to efficiently support setups like Joint PPO: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/retro-contest/gotta_learn_fast_report.pdf
For example, for 188 envs, you could do something like num_envs: 10, num_envs_per_worker: 19.
The dict merge prevents crashes when tune is trying to get resource requests for agents and you override a config subkey. The min iter time prevents iterations from getting too small, incurring high overhead. This is easy to run into on Ape-X since throughput can get very high.
We should use episode ids instead of the timestep to determine when sequences should be cut, since when batches are concatenated, increasing t does not guarantee we are part of the same episode.
* Prevent hasher from running out of memory on large files
* dump out keys
* only print if failed
* remove debugging
* Fix lint error. Reverse adding newline.
Using the actual batch size reduces the risk of mis-accounting. Here, we under-counted samples since in truncate_episodes mode we were doubling the batch size by accident in policy_evaluator.
This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up.
It might be nice to share experience collection between the top-level trainers in the future.
Cleanup: TFPolicyGraph now automatically adds loss input entries for state_in_*, so that graph sub-classes don't need to worry about it.
Multi-GPU support:
Allow setting up model tower replicas with existing state input tensors
Truncate the per-device minibatch slices so that they are always a multiple of max_seq_len.
* move import_thread to a separate file
* sort imports
* group imports regardless of `from`
* re-organize imoprts based on google style
* Update import_thread.py
* fix event_type names in profile statement
* unify duplicate code