[RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943)

This commit is contained in:
Michael Luo 2020-12-24 06:31:35 -08:00 committed by GitHub
parent a2d1215200
commit 4bcd475671
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
3 changed files with 51 additions and 11 deletions

View file

@ -1 +1,21 @@
Implementation of deep deterministic policy gradients (https://arxiv.org/abs/1509.02971), including an Ape-X variant.
# Deep Deterministic Policy Gradient (DDPG)
## Overview
[DDPG](https://arxiv.org/abs/1509.02971) is a model-free off-policy RL algorithm that works well for environments in the continuous-action domain. DDPG employs two networks, a critic Q-network and an actor network. For stable training, DDPG also opts to use target networks to compute labels for the critic's loss function.
For the critic network, the loss function is the L2 loss between critic output and critic target values. The critic target values are usually computed with a one-step bootstrap from the critic and actor target networks. On the other hand, the actor seeks to maximize the critic Q-values in its loss function. This is done by sampling backpropragable actions (via the reparameterization trick) from the actor and evaluating the critic, with frozen weights, on the generated state-action pairs. Like most off-policy algorithms, DDPG employs a replay buffer, which it samples batches from to compute gradients for the actor and critic networks.
## Documentation & Implementation:
1) Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3)
**[Detailed Documentation](https://docs.ray.io/en/latest/rllib-algorithms.html#ddpg)**
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ddpg/ddpg.py)**
2) Ape-X variant of DDPG (Prioritized Experience Replay)
**[Detailed Documentation](https://docs.ray.io/en/latest/rllib-algorithms.html#apex)**
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ddpg/ddpg.py)**

View file

@ -1,7 +1,22 @@
Proximal Policy Optimization (PPO)
==================================
# Proximal Policy Optimization (PPO)
Implementations of:
## Overview
[PPO](https://arxiv.org/abs/1707.06347) is a model-free on-policy RL algorithm that works well for both discrete and continuous action space environments. PPO utilizes an actor-critic framework, where there are two networks, an actor (policy network) and critic network (value function).
There are two formulations of PPO, which are both implemented in RLlib. The first formulation of PPO imitates the prior paper [TRPO](https://arxiv.org/abs/1502.05477) without the complexity of second-order optimization. In this formulation, for every iteration, an old version of an actor-network is saved and the agent seeks to optimize the RL objective while staying close to the old policy. This makes sure that the agent does not destabilize during training. In the second formulation, To mitigate destructive large policy updates, an issue discovered for vanilla policy gradient methods, PPO introduces the surrogate objective, which clips large action probability ratios between the current and old policy. Clipping has been shown in the paper to significantly improve training stability and speed.
## Distributed PPO Algorithms
PPO is a core algorithm in RLlib due to its ability to scale well with the number of nodes. In RLlib, we provide various implementation of distributed PPO, with different underlying execution plans, as shown below.
Distributed baseline PPO is a synchronous distributed RL algorithm. Data collection nodes, which represent the old policy, gather data synchronously to create a large pool of on-policy data from which the agent performs minibatch gradient descent on.
On the other hand, Asychronous PPO (APPO) opts to imitate IMPALA as its distributed execution plan. Data collection nodes gather data asynchronously, which are collected in a circular replay buffer. A target network and doubly-importance sampled surrogate objective is introduced to enforce training stability in the asynchronous data-collection setting.
Lastly, Decentralized Distributed PPO (DDPPO) removes the assumption that gradient-updates must be done on a central node. Instead, gradients are computed remotely on each data collection node and all-reduced at each mini-batch using torch distributed. This allows each workers GPU to be used both for sampling and for training.
## Documentation & Implementation:
1) Proximal Policy Optimization (PPO).
@ -9,15 +24,14 @@ Implementations of:
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ppo.py)**
2) Asynchronous Proximal Policy Optimization (APPO).
2) [Asynchronous Proximal Policy Optimization (APPO)](https://arxiv.org/abs/1912.00167).
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#appo)**
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/appo.py)**
3) Decentralized Distributed Proximal Policy Optimization (DDPPO)
3) [Decentralized Distributed Proximal Policy Optimization (DDPPO)](https://arxiv.org/abs/1911.00357)
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#decentralized-distributed-proximal-policy-optimization-dd-ppo)**
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ddppo.py)**

View file

@ -1,10 +1,16 @@
Soft Actor Critic (SAC)
=======================
# Soft Actor Critic (SAC)
Implementations of:
## Overview
Soft Actor-Critic Algorithm (SAC) and a discrete action extension.
[SAC](https://arxiv.org/abs/1801.01290) is a SOTA model-free off-policy RL algorithm that performs remarkably well on continuous-control domains. SAC employs an actor-critic framework and combats high sample complexity and training stability via learning based on a maximum-entropy framework. Unlike the standard RL objective which aims to maximize sum of reward into the future, SAC seeks to optimize sum of rewards as well as expected entropy over the current policy. In addition to optimizing over an actor and critic with entropy-based objectives, SAC also optimizes for the entropy coeffcient.
## Documentation & Implementation:
[Soft Actor-Critic Algorithm (SAC)](https://arxiv.org/abs/1801.01290) with also [discrete-action support](https://arxiv.org/abs/1910.07207).
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#sac)**
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/sac/sac.py)**