A bunch of minor rllib fixes:
pull in latest baselines atari wrapper changes (and use deepmind wrapper by default)
move reward clipping to policy evaluator
add a2c variant of a3c
reduce vision network fc layer size to 256 units
switch to 84x84 images
doc tweaks
print timesteps in tune status
* Move all ObjectManager members to bottom of class def
* Better Pull requests
- suppress duplicate Pulls
- retry the Pull at the next client after a timeout
- cancel a Pull if the object no longer appears on any clients
* increase object manager Pull timeout
* Make the component failure test harder.
* note
* Notify SubscribeObjectLocations caller of empty list
* Address melih's comments
* Fix wait...
* Make component failure test easier for legacy ray
* lint
* Log a warning on remote object manager failures
* Mark a task that was failed to be forwarded as pending
* Raylet component failure test and make it harder
* Turn on component failure test for xray
* Remove return status from ReleaseSender
* lint
## What do these changes do?
This implements basic task reconstruction in raylet. There are two parts to this PR:
1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary.
2. Task resubmission once a raylet becomes responsible for reconstructing a task.
Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this:
1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR.
2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted).
Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.
This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death.
Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.
Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer
Add AsyncSamplesOptimizer that implements the IMPALA architecture
integrate V-trace with a3c policy graph
audit V-trace integration
benchmark compare vs A3C and with V-trace on/off
PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.
The dict merge prevents crashes when tune is trying to get resource requests for agents and you override a config subkey. The min iter time prevents iterations from getting too small, incurring high overhead. This is easy to run into on Ape-X since throughput can get very high.
* Raise application level exception for actor methods that can't be executed and failed tasks.
* Retry task forwarding for actor tasks.
* Small cleanups
* Move constant to ray_config.
* Create ForwardTaskOrResubmit method.
* Minor
* Clean up queued tasks for dead actors.
* Some cleanups.
* Linting
* Notify task_dependency_manager_ about failed tasks.
* Manage timer lifetime better.
* Use smart pointers to deallocate the timer.
* Fix
* add comment
This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up.
It might be nice to share experience collection between the top-level trainers in the future.
Cleanup: TFPolicyGraph now automatically adds loss input entries for state_in_*, so that graph sub-classes don't need to worry about it.
Multi-GPU support:
Allow setting up model tower replicas with existing state input tensors
Truncate the per-device minibatch slices so that they are always a multiple of max_seq_len.
* Fix one of the stress tests, fix ray.global_state.client_table when called early on.
* Re-enable testWait.
* Convert stress_tests.py to pytest.
* Fix
* Add profile table and store profiling information there.
* Code for dumping timeline.
* Improve color scheme.
* Push timeline events on driver only for raylet.
* Improvements to profiling and timeline visualization
* Some linting
* Small fix.
* Linting
* Propagate node IP address through profiling events.
* Fix test.
* object_id.hex() should return byte string in python 2.
* Include gcs.fbs in node_manager.fbs.
* Remove flatbuffer definition duplication.
* Decode to unicode in Python 3 and bytes in Python 2.
* Minor
* Submit profile events in a batch. Revert some CMake changes.
* Fix
* Workaround test failure.
* Fix linting
* Linting
* Don't return anything from chrome_tracing_dump when filename is provided.
* Remove some redundancy from profile table.
* Linting
* Move TODOs out of docstring.
* Minor
Specifically, subtracts 1 from the target number of workers, taking into
account that the head node has some computational resources.
Do not kill an idle node if it would drop us below the target number of
nodes (in which case we just immediately relaunch).
* Fix documentation indentation.
* Add error table to GCS and push error messages through node manager.
* Add type to error data.
* Linting
* Fix failure_test bug.
* Linting.
* Enable one more test.
* Attempt to fix doc building.
* Restructuring
* Fixes
* More fixes.
* Move current_time_ms function into util.h.
* build_credis.sh: use an up-to-date credis commit.
* build_credis.sh: leveldb is updated, so update build cmds for it
* WIP: make monitor.py issue flush; switch gcs client to use credis
* Experimental: enable automatic GCS flushing with configurable policy.
* Fix linux compilation error
* Fix leveldb build
* Use optimized build for credis
* Address comments
* Attempt to fix tests
## What do these changes do?
**Vectorized envs**: Users can either implement `VectorEnv`, or alternatively set `num_envs=N` to auto-vectorize gym envs (this vectorizes just the action computation part).
```
# CartPole-v0 on single core with 64x64 MLP:
# vector_width=1:
Actions per second 2720.1284458322966
# vector_width=8:
Actions per second 13773.035334888269
# vector_width=64:
Actions per second 37903.20472563333
```
**Async envs**: The more general form of `VectorEnv` is `AsyncVectorEnv`, which allows agents to execute out of lockstep. We use this as an adapter to support `ServingEnv`. Since we can convert any other form of env to `AsyncVectorEnv`, utils.sampler has been rewritten to run against this interface.
**Policy serving**: This provides an env which is not stepped. Rather, the env executes in its own thread, querying the policy for actions via `self.get_action(obs)`, and reporting results via `self.log_returns(rewards)`. We also support logging of off-policy actions via `self.log_action(obs, action)`. This is a more convenient API for some use cases, and also provides parallelizable support for policy serving (for example, if you start a HTTP server in the env) and ingest of offline logs (if the env reads from serving logs).
Any of these types of envs can be passed to RLlib agents. RLlib handles conversions internally in CommonPolicyEvaluator, for example:
```
gym.Env => rllib.VectorEnv => rllib.AsyncVectorEnv
rllib.ServingEnv => rllib.AsyncVectorEnv
```
* Print warning when defining very large remote function or actor.
* Add weak test.
* Check that warnings appear in test.
* Make wait_for_errors actually fail in failure_test.py.
* Use constants for error types.
* Fix