* Log a warning on remote object manager failures
* Mark a task that was failed to be forwarded as pending
* Raylet component failure test and make it harder
* Turn on component failure test for xray
* Remove return status from ReleaseSender
* lint
to support TF version < 1.5
to support rmsprop optimizer in Impala
Before TF1.5, tf.reduce_sum() and tf.reduce_max() has an argument keep_dims which has been renamed as keepdims in later versions.
In the original paper of Impala, they use rmsprop algorithm to optimize the model. We'd better also support it so that users can reproduce their experiments. Without any tuning, say that using the same hyper-parameters as AdamOptimizer, it reaches "episode_reward_mean": 19.083333333333332 in Pong after consume 3,610,350 samples.
* use different random number generator to be compatible with older valgrind versions
* seed from time
* style
* fix
* remove more random devices
* also remove random_device from global scheduler
* rename mutex
* linting
## What do these changes do?
This implements basic task reconstruction in raylet. There are two parts to this PR:
1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary.
2. Task resubmission once a raylet becomes responsible for reconstructing a task.
Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this:
1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR.
2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted).
Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.
This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death.
Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.
* raylet memory corruption fixes
* add util function to translate boost error to ray status
* tcp client connection now using ray status utility function
* lint
* Add set to lineage cache entry to track nodes already forwarded to.
* Uncommitted lineage function naming, documentation.
* Simple test for uncommitted lineage with a marked task.
* Rebased, changed tests to use ClientID::nil.
* Bug fix, change MergeLineageHelper function type.
* Formatting.
* Checks and test changes based on PR comments.
* GetUncommittedLineage now always returns at least the requested task ID.
* Bug fix (return at least requested task ID)
* Formatting
Rename AsyncSamplesOptimizer -> AsyncReplayOptimizer
Add AsyncSamplesOptimizer that implements the IMPALA architecture
integrate V-trace with a3c policy graph
audit V-trace integration
benchmark compare vs A3C and with V-trace on/off
PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.
This also removes the async resetting code in VectorEnv. While that improves benchmark performance slightly, it substantially complicates env configuration and probably isn't worth it for most envs.
This makes it easy to efficiently support setups like Joint PPO: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/retro-contest/gotta_learn_fast_report.pdf
For example, for 188 envs, you could do something like num_envs: 10, num_envs_per_worker: 19.
The dict merge prevents crashes when tune is trying to get resource requests for agents and you override a config subkey. The min iter time prevents iterations from getting too small, incurring high overhead. This is easy to run into on Ape-X since throughput can get very high.
We should use episode ids instead of the timestep to determine when sequences should be cut, since when batches are concatenated, increasing t does not guarantee we are part of the same episode.
* Prevent hasher from running out of memory on large files
* dump out keys
* only print if failed
* remove debugging
* Fix lint error. Reverse adding newline.