ray/rllib/agents/a3c
Avnish Narayan 026bf01071
[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535)
* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7

* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

* Reformatting

* Fixing tests

* Move atari-py install conditional to req.txt

* migrate to new ale install method

* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7
* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

Move atari-py install conditional to req.txt

migrate to new ale install method

Make parametric_actions_cartpole return float32 actions/obs

Adding type conversions if obs/actions don't match space

Add utils to make elements match gym space dtypes

Co-authored-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-03 16:24:00 +01:00
..
tests [RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535) 2021-11-03 16:24:00 +01:00
__init__.py [rllib] Support multi-agent training in pipeline impls, add easy flag to enable (#7338) 2020-03-02 15:16:37 -08:00
a2c.py [RLlib] Unify the way we create local replay buffer for all agents (#19627) 2021-10-26 20:56:02 +02:00
a3c.py [RLlib] Unify the way we create local replay buffer for all agents (#19627) 2021-10-26 20:56:02 +02:00
a3c_tf_policy.py [RLlib; Docs overhaul] Docstring cleanup: rllib/utils (#19829) 2021-11-01 21:46:02 +01:00
a3c_torch_policy.py [RLlib] Fix deprecated warning for torch_ops.py (soft-replaced by torch_utils.py). (#19982) 2021-11-03 10:00:46 +01:00
README.md [RLLib] Readme.md Documentation for Almost All Algorithms in rllib/agents (#13035) 2020-12-29 18:45:55 -05:00

Advantage Actor-Critic (A2C, A3C)

Overview

Advantage Actor-Critic proposes two distributed model-free on-policy RL algorithms, A3C and A2C. These algorithms are distributed versions of the vanilla Policy Gradient (PG) algorithm with different distributed execution patterns. The paper suggests accelerating training via scaling data collection, i.e. introducing worker nodes, which carry copies of the central node's policy network and collect data from the environment in parallel. This data is used on each worker to compute gradients. The central node applies each of these gradients and then sends updated weights back to the workers.

In A2C, the worker nodes synchronously collect data. The collected data forms a giant batch of data, from which the central node (the central policy) computes gradient updates. On the other hand, in A3C, the worker nodes generate data asynchronously, compute gradients from the data, and send computed gradients to the central node. Note that the workers in A3C may be slightly out-of-sync with the central node due to asynchrony, which may induce biases in learning.

Documentation & Implementation:

  1. A2C.

    Detailed Documentation

    Implementation

  2. A3C.

    Detailed Documentation

    Implementation