From 4bcd47567183b2fc5e618443e187d7fe7715e8f1 Mon Sep 17 00:00:00 2001
From: Michael Luo <michael.luo123456789@gmail.com>
Date: Thu, 24 Dec 2020 06:31:35 -0800
Subject: [PATCH] [RLlib] Improved Documentation for PPO, DDPG, and SAC
 (#12943)

---
 rllib/agents/ddpg/README.md | 22 +++++++++++++++++++++-
 rllib/agents/ppo/README.md  | 26 ++++++++++++++++++++------
 rllib/agents/sac/README.md  | 14 ++++++++++----
 3 files changed, 51 insertions(+), 11 deletions(-)

diff --git a/rllib/agents/ddpg/README.md b/rllib/agents/ddpg/README.md
index 93c32b0a2..5d4f10b80 100644
--- a/rllib/agents/ddpg/README.md
+++ b/rllib/agents/ddpg/README.md
@@ -1 +1,21 @@
-Implementation of deep deterministic policy gradients (https://arxiv.org/abs/1509.02971), including an Ape-X variant.
+# Deep Deterministic Policy Gradient (DDPG)
+
+## Overview 
+
+[DDPG](https://arxiv.org/abs/1509.02971) is a model-free off-policy RL algorithm that works well for environments in the continuous-action domain. DDPG employs two networks, a critic Q-network and an actor network. For stable training, DDPG also opts to use target networks to compute labels for the critic's loss function. 
+
+For the critic network, the loss function is the L2 loss between critic output and critic target values. The critic target values are usually computed with a one-step bootstrap from the critic and actor target networks. On the other hand, the actor seeks to maximize the critic Q-values in its loss function. This is done by sampling backpropragable actions (via the reparameterization trick) from the actor and evaluating the critic, with frozen weights, on the generated state-action pairs. Like most off-policy algorithms, DDPG employs a replay buffer, which it samples batches from to compute gradients for the actor and critic networks. 
+
+## Documentation & Implementation:
+
+1) Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3)
+
+    **[Detailed Documentation](https://docs.ray.io/en/latest/rllib-algorithms.html#ddpg)**
+
+    **[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ddpg/ddpg.py)**
+    
+2) Ape-X variant of DDPG (Prioritized Experience Replay)
+
+    **[Detailed Documentation](https://docs.ray.io/en/latest/rllib-algorithms.html#apex)**
+
+    **[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ddpg/ddpg.py)**
diff --git a/rllib/agents/ppo/README.md b/rllib/agents/ppo/README.md
index 1a11124f5..4095f3550 100644
--- a/rllib/agents/ppo/README.md
+++ b/rllib/agents/ppo/README.md
@@ -1,7 +1,22 @@
-Proximal Policy Optimization (PPO)
-==================================
+# Proximal Policy Optimization (PPO)
 
-Implementations of:
+## Overview 
+
+[PPO](https://arxiv.org/abs/1707.06347) is a model-free on-policy RL algorithm that works well for both discrete and continuous action space environments. PPO utilizes an actor-critic framework, where there are two networks, an actor (policy network) and critic network (value function). 
+
+There are two formulations of PPO, which are both implemented in RLlib. The first formulation of PPO imitates the prior paper [TRPO](https://arxiv.org/abs/1502.05477) without the complexity of second-order optimization. In this formulation, for every iteration, an old version of an actor-network is saved and the agent seeks to optimize the RL objective while staying close to the old policy. This makes sure that the agent does not destabilize during training. In the second formulation, To mitigate destructive large policy updates, an issue discovered for vanilla policy gradient methods, PPO introduces the surrogate objective, which clips large action probability ratios between the current and old policy. Clipping has been shown in the paper to significantly improve training stability and speed. 
+
+## Distributed PPO Algorithms
+
+PPO is a core algorithm in RLlib due to its ability to scale well with the number of nodes. In RLlib, we provide various implementation of distributed PPO, with different underlying execution plans, as shown below. 
+
+Distributed baseline PPO is a synchronous distributed RL algorithm. Data collection nodes, which represent the old policy, gather data synchronously to create a large pool of on-policy data from which the agent performs minibatch gradient descent on.
+
+On the other hand, Asychronous PPO (APPO) opts to imitate IMPALA as its distributed execution plan. Data collection nodes gather data asynchronously, which are collected in a circular replay buffer. A target network and doubly-importance sampled surrogate objective is introduced to enforce training stability in the asynchronous data-collection setting.
+
+Lastly, Decentralized Distributed PPO (DDPPO) removes the assumption that gradient-updates must be done on a central node.  Instead, gradients are computed remotely on each data collection node and all-reduced at each mini-batch using torch distributed. This allows each worker’s GPU to be used both for sampling and for training.
+
+## Documentation & Implementation:
 
 1) Proximal Policy Optimization (PPO). 
 
@@ -9,15 +24,14 @@ Implementations of:
 
     **[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ppo.py)**
 
-2) Asynchronous Proximal Policy Optimization (APPO).
+2) [Asynchronous Proximal Policy Optimization (APPO)](https://arxiv.org/abs/1912.00167).
 
     **[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#appo)**
 
     **[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/appo.py)**
 
-3) Decentralized Distributed Proximal Policy Optimization (DDPPO)
+3) [Decentralized Distributed Proximal Policy Optimization (DDPPO)](https://arxiv.org/abs/1911.00357)
 
     **[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#decentralized-distributed-proximal-policy-optimization-dd-ppo)**
 
     **[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ddppo.py)**
-
diff --git a/rllib/agents/sac/README.md b/rllib/agents/sac/README.md
index 8aa0c4c45..13ce9a644 100644
--- a/rllib/agents/sac/README.md
+++ b/rllib/agents/sac/README.md
@@ -1,10 +1,16 @@
-Soft Actor Critic (SAC)
-=======================
+# Soft Actor Critic (SAC)
 
-Implementations of:
+## Overview 
 
-Soft Actor-Critic Algorithm (SAC) and a discrete action extension. 
+[SAC](https://arxiv.org/abs/1801.01290) is a SOTA model-free off-policy RL algorithm that performs remarkably well on continuous-control domains. SAC employs an actor-critic framework and combats high sample complexity and training stability via learning based on a maximum-entropy framework. Unlike the standard RL objective which aims to maximize sum of reward into the future, SAC seeks to optimize sum of rewards as well as expected entropy over the current policy. In addition to optimizing over an actor and critic with entropy-based objectives, SAC also optimizes for the entropy coeffcient. 
+
+## Documentation & Implementation:
+
+[Soft Actor-Critic Algorithm (SAC)](https://arxiv.org/abs/1801.01290) with also [discrete-action support](https://arxiv.org/abs/1910.07207). 
 
 **[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#sac)**
 
 **[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/sac/sac.py)**
+
+
+