ray/rllib/algorithms/ddpg
Yi Cheng fd0f967d2e
Revert "[RLlib] Move (A/DD)?PPO and IMPALA algos to algorithms dir and rename policy and trainer classes. (#25346)" (#25420)
This reverts commit e4ceae19ef.

Reverts #25346

linux://python/ray/tests:test_client_library_integration never fail before this PR.

In the CI of the reverted PR, it also fails (https://buildkite.com/ray-project/ray-builders-pr/builds/34079#01812442-c541-4145-af22-2a012655c128). So high likely it's because of this PR.

And test output failure seems related as well (https://buildkite.com/ray-project/ray-builders-branch/builds/7923#018125c2-4812-4ead-a42f-7fddb344105b)
2022-06-02 20:38:44 -07:00
..
tests [RLlib] Apex-DDPG TrainerConfig objects. (#25279) 2022-05-30 19:45:38 +02:00
__init__.py [RLlib] Apex-DDPG TrainerConfig objects. (#25279) 2022-05-30 19:45:38 +02:00
apex.py Revert "[RLlib] Move (A/DD)?PPO and IMPALA algos to algorithms dir and rename policy and trainer classes. (#25346)" (#25420) 2022-06-02 20:38:44 -07:00
ddpg.py [RLlib] Apex-DDPG TrainerConfig objects. (#25279) 2022-05-30 19:45:38 +02:00
ddpg_tf_model.py Clean up docstyle in python modules and add LINT rule (#25272) 2022-06-01 11:27:54 -07:00
ddpg_tf_policy.py Clean up docstyle in python modules and add LINT rule (#25272) 2022-06-01 11:27:54 -07:00
ddpg_torch_model.py Clean up docstyle in python modules and add LINT rule (#25272) 2022-06-01 11:27:54 -07:00
ddpg_torch_policy.py [RLlib] Agents to algos: DQN w/o Apex and R2D2, DDPG/TD3, SAC, SlateQ, QMIX, PG, Bandits (#24896) 2022-05-19 18:30:42 +02:00
noop_model.py [RLlib] Agents to algos: DQN w/o Apex and R2D2, DDPG/TD3, SAC, SlateQ, QMIX, PG, Bandits (#24896) 2022-05-19 18:30:42 +02:00
README.md [RLlib] Agents to algos: DQN w/o Apex and R2D2, DDPG/TD3, SAC, SlateQ, QMIX, PG, Bandits (#24896) 2022-05-19 18:30:42 +02:00
td3.py [RLlib] TD3 config objects. (#25065) 2022-05-23 10:07:13 +02:00

Deep Deterministic Policy Gradient (DDPG)

Overview

DDPG is a model-free off-policy RL algorithm that works well for environments in the continuous-action domain. DDPG employs two networks, a critic Q-network and an actor network. For stable training, DDPG also opts to use target networks to compute labels for the critic's loss function.

For the critic network, the loss function is the L2 loss between critic output and critic target values. The critic target values are usually computed with a one-step bootstrap from the critic and actor target networks. On the other hand, the actor seeks to maximize the critic Q-values in its loss function. This is done by sampling backpropragable actions (via the reparameterization trick) from the actor and evaluating the critic, with frozen weights, on the generated state-action pairs. Like most off-policy algorithms, DDPG employs a replay buffer, which it samples batches from to compute gradients for the actor and critic networks.

Documentation & Implementation:

  1. Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3)

    Detailed Documentation

    Implementation

  2. Ape-X variant of DDPG (Prioritized Experience Replay)

    Detailed Documentation

    Implementation