![]() * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 * Reformatting * Fixing tests * Move atari-py install conditional to req.txt * migrate to new ale install method * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 Move atari-py install conditional to req.txt migrate to new ale install method Make parametric_actions_cartpole return float32 actions/obs Adding type conversions if obs/actions don't match space Add utils to make elements match gym space dtypes Co-authored-by: Jun Gong <jungong@anyscale.com> Co-authored-by: sven1977 <svenmika1977@gmail.com> |
||
---|---|---|
.. | ||
tests | ||
__init__.py | ||
apex.py | ||
ddpg.py | ||
ddpg_tf_model.py | ||
ddpg_tf_policy.py | ||
ddpg_torch_model.py | ||
ddpg_torch_policy.py | ||
noop_model.py | ||
README.md | ||
td3.py |
Deep Deterministic Policy Gradient (DDPG)
Overview
DDPG is a model-free off-policy RL algorithm that works well for environments in the continuous-action domain. DDPG employs two networks, a critic Q-network and an actor network. For stable training, DDPG also opts to use target networks to compute labels for the critic's loss function.
For the critic network, the loss function is the L2 loss between critic output and critic target values. The critic target values are usually computed with a one-step bootstrap from the critic and actor target networks. On the other hand, the actor seeks to maximize the critic Q-values in its loss function. This is done by sampling backpropragable actions (via the reparameterization trick) from the actor and evaluating the critic, with frozen weights, on the generated state-action pairs. Like most off-policy algorithms, DDPG employs a replay buffer, which it samples batches from to compute gradients for the actor and critic networks.
Documentation & Implementation:
-
Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3)
-
Ape-X variant of DDPG (Prioritized Experience Replay)