From 7c925fe99fc1457cc9ecd9d01be96c9ed9fd6ab1 Mon Sep 17 00:00:00 2001
From: Christy Bergman <cbergman@gmail.com>
Date: Wed, 13 Jul 2022 01:46:36 -0700
Subject: [PATCH] [RLlib; docs] Re-organize algorithms so TOC matches README.
 (#26339)

---
 doc/source/rllib/rllib-algorithms.rst | 898 +++++++++++++-------------
 rllib/README.rst                      |   4 +-
 2 files changed, 447 insertions(+), 455 deletions(-)

diff --git a/doc/source/rllib/rllib-algorithms.rst b/doc/source/rllib/rllib-algorithms.rst
index 8ccd984db..78cfb4a77 100644
--- a/doc/source/rllib/rllib-algorithms.rst
+++ b/doc/source/rllib/rllib-algorithms.rst
@@ -75,105 +75,98 @@ Algorithm                        Frameworks Discrete Actions        Continuous A
 .. _`TS`: rllib-algorithms.html#lints
 .. _`LinUCB`: rllib-algorithms.html#lin-ucb
 
-High-throughput architectures
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Offline
+~~~~~~~
 
-.. _apex:
+.. _bc:
 
-Distributed Prioritized Experience Replay (Ape-X)
--------------------------------------------------
+Behavior Cloning (BC; derived from MARWIL implementation)
+---------------------------------------------------------
 |pytorch| |tensorflow|
-`[paper] <https://arxiv.org/abs/1803.00933>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/apex_dqn/apex_dqn.py>`__
-Ape-X variations of DQN and DDPG (`APEX_DQN <https://github.com/ray-project/ray/blob/master/rllib/algorithms/apex_dqn/apex_dqn.py>`__, `APEX_DDPG <https://github.com/ray-project/ray/blob/master/rllib/algorithms/apex_ddpg/apex_ddpg.py>`__) use a single GPU learner and many CPU workers for experience collection. Experience collection can scale to hundreds of CPU workers due to the distributed prioritization of experience prior to storage in replay buffers.
+`[paper] <http://papers.nips.cc/paper/7866-exponentially-weighted-imitation-learning-for-batched-historical-data>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/bc/bc.py>`__
 
-.. figure:: images/apex-arch.svg
+Our behavioral cloning implementation is directly derived from our `MARWIL`_ implementation,
+with the only difference being the ``beta`` parameter force-set to 0.0. This makes
+BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards.
+BC requires the `offline datasets API <rllib-offline.html>`__ to be used.
 
-    Ape-X architecture
+Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/bc/cartpole-bc.yaml>`__
 
-Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_dqn/pong-apex-dqn.yaml>`__, `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_ddpg/pendulum-apex-ddpg.yaml>`__, `MountainCarContinuous-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_ddpg/mountaincarcontinuous-apex-ddpg.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_dqn/atari-apex-dqn.yaml>`__.
+**BC-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
 
-**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
-
-=============  ================================  ========================================
- Atari env     RLlib Ape-X 8-workers             Mnih et al Async DQN 16-workers
-=============  ================================  ========================================
-BeamRider      6134                              ~6000
-Breakout       123                               ~50
-Qbert          15302                             ~1200
-SpaceInvaders  686                               ~600
-=============  ================================  ========================================
-
-**Scalability**:
-
-=============  ================================  ========================================
- Atari env     RLlib Ape-X 8-workers @1 hour     Mnih et al Async DQN 16-workers @1 hour
-=============  ================================  ========================================
-BeamRider      4873                              ~1000
-Breakout       77                                ~10
-Qbert          4083                              ~500
-SpaceInvaders  646                               ~300
-=============  ================================  ========================================
-
-.. figure:: images/apex.png
-
-    Ape-X using 32 workers in RLlib vs vanilla DQN (orange) and A3C (blue) on PongNoFrameskip-v4.
-
-**Ape-X specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/apex_dqn/apex_dqn.py
+.. literalinclude:: ../../../rllib/algorithms/bc/bc.py
    :language: python
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
-.. _impala:
+.. _crr:
 
-Importance Weighted Actor-Learner Architecture (IMPALA)
--------------------------------------------------------
-|pytorch| |tensorflow|
-`[paper] <https://arxiv.org/abs/1802.01561>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/impala/impala.py>`__
-In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's IMPALA implementation uses DeepMind's reference `V-trace code <https://github.com/deepmind/scalable_agent/blob/master/vtrace.py>`__. Note that we do not provide a deep residual network out of the box, but one can be plugged in as a `custom model <rllib-models.html#custom-models-tensorflow>`__. Multiple learner GPUs and experience replay are also supported.
+Critic Regularized Regression (CRR)
+-----------------------------------
+|pytorch|
+`[paper] <https://arxiv.org/abs/2006.15134>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/crr/crr.py>`__
 
-.. figure:: images/impala-arch.svg
+CRR is another offline RL algorithm based on Q-learning that can learn from an offline experience replay.
+The challenge in applying existing Q-learning algorithms to offline RL lies in the overestimation of the Q-function, as well as, the lack of exploration beyond the observed data.
+The latter becomes increasingly important during bootstrapping in the bellman equation, where the Q-function queried for the next state's Q-value(s) does not have support in the observed data.
+To mitigate these issues, CRR implements a simple and yet powerful idea of "value-filtered regression".
+The key idea is to use a learned critic to filter-out the non-promising transitions from the replay dataset. For more details, please refer to the paper (see link above).
 
-    IMPALA architecture
+Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/crr/cartpole-v0-crr.yaml>`__, `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/crr/pendulum-v1-crr.yaml>`__
 
-Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/pong-impala.yaml>`__, `vectorized configuration <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/pong-impala-vectorized.yaml>`__, `multi-gpu configuration <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/pong-impala-fast.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/atari-impala.yaml>`__
-
-**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
-
-=============  ==================================  ====================================
- Atari env     RLlib IMPALA 32-workers             Mnih et al A3C 16-workers
-=============  ==================================  ====================================
-BeamRider      2071                                ~3000
-Breakout       385                                 ~150
-Qbert          4068                                ~1000
-SpaceInvaders  719                                 ~600
-=============  ==================================  ====================================
-
-**Scalability:**
-
-=============  ===============================  =================================
- Atari env     RLlib IMPALA 32-workers @1 hour  Mnih et al A3C 16-workers @1 hour
-=============  ===============================  =================================
-BeamRider      3181                             ~1000
-Breakout       538                              ~10
-Qbert          10850                            ~500
-SpaceInvaders  843                              ~300
-=============  ===============================  =================================
-
-.. figure:: images/impala.png
-
-   Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using a pair of V100 GPUs and 128 CPU workers.
-   The maximum training throughput reached is ~30k transitions per second (~120k environment frames per second).
-
-**IMPALA-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/impala/impala.py
+.. literalinclude:: ../../../rllib/algorithms/crr/crr.py
    :language: python
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
+   
+.. _cql:
+
+Conservative Q-Learning (CQL)
+-----------------------------
+|pytorch| |tensorflow|
+`[paper] <https://arxiv.org/abs/2006.04779>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/cql/cql.py>`__
+
+In offline RL, the algorithm has no access to an environment, but can only sample from a fixed dataset of pre-collected state-action-reward tuples.
+In particular, CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution via
+conservative critic estimates. It does so by adding a simple Q regularizer loss to the standard Bellman update loss.
+This ensures that the critic does not output overly-optimistic Q-values. This conservative
+correction term can be added on top of any off-policy Q-learning algorithm (here, we provide this for SAC).
+
+RLlib's CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC- and CQL configs is the ``bc_iters`` parameter in CQL, indicating how many gradient steps we perform over the BC loss. CQL is evaluated on the `D4RL <https://github.com/rail-berkeley/d4rl>`__ benchmark, which has pre-collected offline datasets for many types of environments.
+
+Tuned examples: `HalfCheetah Random <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/cql/halfcheetah-cql.yaml>`__, `Hopper Random <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/cql/hopper-cql.yaml>`__
+
+**CQL-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/cql/cql.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
+.. _marwil:
+
+Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)
+-----------------------------------------------------------
+|pytorch| |tensorflow|
+`[paper] <http://papers.nips.cc/paper/7866-exponentially-weighted-imitation-learning-for-batched-historical-data>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/marwil/marwil.py>`__
+
+MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on batched historical data.
+When the ``beta`` hyperparameter is set to zero, the MARWIL objective reduces to vanilla imitation learning (see `BC`_).
+MARWIL requires the `offline datasets API <rllib-offline.html>`__ to be used.
+
+Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/marwil/cartpole-marwil.yaml>`__
+
+**MARWIL-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/marwil/marwil.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+   
+Model-free On-policy RL
+~~~~~~~~~~~~~~~~~~~~~~~
 
 .. _appo:
 
@@ -227,193 +220,6 @@ Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rll
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
-Gradient-based
-~~~~~~~~~~~~~~
-
-.. _a2c:
-
-Advantage Actor-Critic (A2C)
-----------------------------
-|pytorch| |tensorflow|
-`[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a2c/a2c.py>`__
-A2C scales to 16-32+ worker processes depending on the environment and supports microbatching
-(i.e., gradient accumulation), which can be enabled by setting the ``microbatch_size`` config.
-Microbatching allows for training with a ``train_batch_size`` much larger than GPU memory.
-
-.. figure:: images/a2c-arch.svg
-
-    A2C architecture
-
-Tuned examples: `Atari environments <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a2c/atari-a2c.yaml>`__
-
-.. tip::
-    Consider using `IMPALA <#importance-weighted-actor-learner-architecture-impala>`__ for faster training with similar timestep efficiency.
-
-**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
-
-=============  ========================  ==============================
-Atari env      RLlib A2C 5-workers       Mnih et al A3C 16-workers
-=============  ========================  ==============================
-BeamRider      1401                      ~3000
-Breakout       374                       ~150
-Qbert          3620                      ~1000
-SpaceInvaders  692                       ~600
-=============  ========================  ==============================
-
-**A2C-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/a2c/a2c.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-
-.. _a3c:
-
-Asynchronous Advantage Actor-Critic (A3C)
------------------------------------------
-|pytorch| |tensorflow|
-`[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c.py>`__
-A3C is the asynchronous version of A2C, where gradients are computed on the workers directly after trajectory rollouts,
-and only then shipped to a central learner to accumulate these gradients on the central model. After the central model update, parameters are broadcast back to
-all workers.
-Similar to A2C, A3C scales to 16-32+ worker processes depending on the environment.
-
-Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a3c/pong-a3c.yaml>`__
-
-.. tip::
-    Consider using `IMPALA <#importance-weighted-actor-learner-architecture-impala>`__ for faster training with similar timestep efficiency.
-
-**A3C-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/a3c/a3c.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-
-.. _ddpg:
-
-Deep Deterministic Policy Gradients (DDPG)
-------------------------------------------
-|pytorch| |tensorflow|
-`[paper] <https://arxiv.org/abs/1509.02971>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ddpg/ddpg.py>`__
-DDPG is implemented similarly to DQN (below). The algorithm can be scaled by increasing the number of workers or using Ape-X.
-The improvements from `TD3 <https://spinningup.openai.com/en/latest/algorithms/td3.html>`__ are available as ``TD3``.
-
-.. figure:: images/dqn-arch.svg
-
-    DDPG architecture (same as DQN)
-
-Tuned examples: `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/pendulum-ddpg.yaml>`__, `MountainCarContinuous-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/mountaincarcontinuous-ddpg.yaml>`__, `HalfCheetah-v2 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/halfcheetah-ddpg.yaml>`__.
-
-**DDPG-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/ddpg/ddpg.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-
-.. _td3:
-
-Twin Delayed DDPG (TD3)
------------------------
-|pytorch| |tensorflow|
-`[paper] <https://arxiv.org/abs/1509.02971>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/td3/td3.py>`__
-TD3 represents an improvement over DDPG. Its implementation is available in RLlib as `TD3 <https://spinningup.openai.com/en/latest/algorithms/td3.html>`__.
-
-Tuned examples: `TD3 Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/td3/pendulum-td3.yaml>`__, `TD3 InvertedPendulum-v2 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/td3/invertedpendulum-td3.yaml>`__, `TD3 Mujoco suite (Ant-v2, HalfCheetah-v2, Hopper-v2, Walker2d-v2) <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/td3/mujoco-td3.yaml>`__.
-
-**TD3-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/td3/td3.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-
-.. _dqn:
-
-Deep Q Networks (DQN, Rainbow, Parametric DQN)
-----------------------------------------------
-|pytorch| |tensorflow|
-`[paper] <https://arxiv.org/abs/1312.5602>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/dqn/dqn.py>`__
-DQN can be scaled by increasing the number of workers or using Ape-X. Memory usage is reduced by compressing samples in the replay buffer with LZ4. All of the DQN improvements evaluated in `Rainbow <https://arxiv.org/abs/1710.02298>`__ are available, though not all are enabled by default. See also how to use `parametric-actions in DQN <rllib-models.html#variable-length-parametric-action-spaces>`__.
-
-.. figure:: images/dqn-arch.svg
-
-    DQN architecture
-
-Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/pong-dqn.yaml>`__, `Rainbow configuration <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/pong-rainbow.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/atari-dqn.yaml>`__, `with Dueling and Double-Q <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/atari-duel-ddqn.yaml>`__, `with Distributional DQN <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/atari-dist-dqn.yaml>`__.
-
-.. tip::
-    Consider using `Ape-X <#distributed-prioritized-experience-replay-ape-x>`__ for faster training with similar timestep efficiency.
-
-.. hint::
-    For a complete `rainbow <https://arxiv.org/pdf/1710.02298.pdf>`__ setup,
-    make the following changes to the default DQN config:
-    ``"n_step": [between 1 and 10],
-    "noisy": True,
-    "num_atoms": [more than 1],
-    "v_min": -10.0,
-    "v_max": 10.0``
-    (set ``v_min`` and ``v_max`` according to your expected range of returns).
-
-**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
-
-=============  ========================  =============================  ==============================  ===============================
- Atari env     RLlib DQN                 RLlib Dueling DDQN             RLlib Dist. DQN                 Hessel et al. DQN
-=============  ========================  =============================  ==============================  ===============================
-BeamRider      2869                      1910                           4447                            ~2000
-Breakout       287                       312                            410                             ~150
-Qbert          3921                      7968                           15780                           ~4000
-SpaceInvaders  650                       1001                           1025                            ~500
-=============  ========================  =============================  ==============================  ===============================
-
-**DQN-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/dqn/dqn.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-
-.. _r2d2:
-
-Recurrent Replay Distributed DQN (R2D2)
----------------------------------------
-|pytorch| |tensorflow|
-`[paper] <https://openreview.net/pdf?id=r1lyTjAqYX>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/r2d2/r2d2.py>`__
-R2D2 can be scaled by increasing the number of workers. All of the DQN improvements evaluated in `Rainbow <https://arxiv.org/abs/1710.02298>`__ are available, though not all are enabled by default.
-
-Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/r2d2/stateless-cartpole-r2d2.yaml>`__
-
-
-.. _pg:
-
-Policy Gradients
-----------------
-|pytorch| |tensorflow|
-`[paper] <https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/pg/pg.py>`__
-We include a vanilla policy gradients implementation as an example algorithm.
-
-.. figure:: images/a2c-arch.svg
-
-    Policy gradients architecture (same as A2C)
-
-Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/pg/cartpole-pg.yaml>`__
-
-**PG-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/pg/pg.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
 .. _ppo:
 
 Proximal Policy Optimization (PPO)
@@ -474,6 +280,305 @@ HalfCheetah    9664                       ~7700
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
+.. _impala:
+
+Importance Weighted Actor-Learner Architecture (IMPALA)
+-------------------------------------------------------
+|pytorch| |tensorflow|
+`[paper] <https://arxiv.org/abs/1802.01561>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/impala/impala.py>`__
+In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's IMPALA implementation uses DeepMind's reference `V-trace code <https://github.com/deepmind/scalable_agent/blob/master/vtrace.py>`__. Note that we do not provide a deep residual network out of the box, but one can be plugged in as a `custom model <rllib-models.html#custom-models-tensorflow>`__. Multiple learner GPUs and experience replay are also supported.
+
+.. figure:: images/impala-arch.svg
+
+    IMPALA architecture
+
+Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/pong-impala.yaml>`__, `vectorized configuration <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/pong-impala-vectorized.yaml>`__, `multi-gpu configuration <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/pong-impala-fast.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/atari-impala.yaml>`__
+
+**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
+
+=============  ==================================  ====================================
+ Atari env     RLlib IMPALA 32-workers             Mnih et al A3C 16-workers
+=============  ==================================  ====================================
+BeamRider      2071                                ~3000
+Breakout       385                                 ~150
+Qbert          4068                                ~1000
+SpaceInvaders  719                                 ~600
+=============  ==================================  ====================================
+
+**Scalability:**
+
+=============  ===============================  =================================
+ Atari env     RLlib IMPALA 32-workers @1 hour  Mnih et al A3C 16-workers @1 hour
+=============  ===============================  =================================
+BeamRider      3181                             ~1000
+Breakout       538                              ~10
+Qbert          10850                            ~500
+SpaceInvaders  843                              ~300
+=============  ===============================  =================================
+
+.. figure:: images/impala.png
+
+   Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using a pair of V100 GPUs and 128 CPU workers.
+   The maximum training throughput reached is ~30k transitions per second (~120k environment frames per second).
+
+**IMPALA-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/impala/impala.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+   
+.. _a2c:
+
+Advantage Actor-Critic (A2C)
+----------------------------
+|pytorch| |tensorflow|
+`[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a2c/a2c.py>`__
+A2C scales to 16-32+ worker processes depending on the environment and supports microbatching
+(i.e., gradient accumulation), which can be enabled by setting the ``microbatch_size`` config.
+Microbatching allows for training with a ``train_batch_size`` much larger than GPU memory.
+
+.. figure:: images/a2c-arch.svg
+
+    A2C architecture
+
+Tuned examples: `Atari environments <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a2c/atari-a2c.yaml>`__
+
+.. tip::
+    Consider using `IMPALA <#importance-weighted-actor-learner-architecture-impala>`__ for faster training with similar timestep efficiency.
+
+**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
+
+=============  ========================  ==============================
+Atari env      RLlib A2C 5-workers       Mnih et al A3C 16-workers
+=============  ========================  ==============================
+BeamRider      1401                      ~3000
+Breakout       374                       ~150
+Qbert          3620                      ~1000
+SpaceInvaders  692                       ~600
+=============  ========================  ==============================
+
+**A2C-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/a2c/a2c.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
+
+.. _a3c:
+
+Asynchronous Advantage Actor-Critic (A3C)
+-----------------------------------------
+|pytorch| |tensorflow|
+`[paper] <https://arxiv.org/abs/1602.01783>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/a3c/a3c.py>`__
+A3C is the asynchronous version of A2C, where gradients are computed on the workers directly after trajectory rollouts,
+and only then shipped to a central learner to accumulate these gradients on the central model. After the central model update, parameters are broadcast back to
+all workers.
+Similar to A2C, A3C scales to 16-32+ worker processes depending on the environment.
+
+Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/a3c/pong-a3c.yaml>`__
+
+.. tip::
+    Consider using `IMPALA <#importance-weighted-actor-learner-architecture-impala>`__ for faster training with similar timestep efficiency.
+
+**A3C-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/a3c/a3c.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
+.. _pg:
+
+Policy Gradients (PG)
+---------------------
+|pytorch| |tensorflow|
+`[paper] <https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/pg/pg.py>`__
+We include a vanilla policy gradients implementation as an example algorithm.
+
+.. figure:: images/a2c-arch.svg
+
+    Policy gradients architecture (same as A2C)
+
+Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/pg/cartpole-pg.yaml>`__
+
+**PG-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/pg/pg.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
+.. _maml:
+
+Model-Agnostic Meta-Learning (MAML)
+-----------------------------------
+|pytorch| |tensorflow|
+`[paper] <https://arxiv.org/abs/1703.03400>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/maml/maml.py>`__
+
+RLlib's MAML implementation is a meta-learning method for learning and quick adaptation across different tasks for continuous control. Code here is adapted from https://github.com/jonasrothfuss, which outperforms vanilla MAML and avoids computation of the higher order gradients during the meta-update step. MAML is evaluated on custom environments that are described in greater detail `here <https://github.com/ray-project/ray/blob/master/rllib/env/apis/task_settable_env.py>`__.
+
+MAML uses additional metrics to measure performance; ``episode_reward_mean`` measures the agent's returns before adaptation, ``episode_reward_mean_adapt_N`` measures the agent's returns after N gradient steps of inner adaptation, and ``adaptation_delta`` measures the difference in performance before and after adaptation. Examples can be seen `here <https://github.com/ray-project/rl-experiments/tree/master/maml>`__.
+
+Tuned examples: HalfCheetahRandDirecEnv (`Env <https://github.com/ray-project/ray/blob/master/rllib/examples/env/halfcheetah_rand_direc.py>`__, `Config <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/maml/halfcheetah-rand-direc-maml.yaml>`__), AntRandGoalEnv (`Env <https://github.com/ray-project/ray/blob/master/rllib/examples/env/ant_rand_goal.py>`__, `Config <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/maml/ant-rand-goal-maml.yaml>`__), PendulumMassEnv (`Env <https://github.com/ray-project/ray/blob/master/rllib/examples/env/pendulum_mass.py>`__, `Config <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/maml/pendulum-mass-maml.yaml>`__)
+
+**MAML-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/maml/maml.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+   
+Model-free Off-policy RL
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. _apex:
+
+Distributed Prioritized Experience Replay (Ape-X)
+-------------------------------------------------
+|pytorch| |tensorflow|
+`[paper] <https://arxiv.org/abs/1803.00933>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/apex_dqn/apex_dqn.py>`__
+Ape-X variations of DQN and DDPG (`APEX_DQN <https://github.com/ray-project/ray/blob/master/rllib/algorithms/apex_dqn/apex_dqn.py>`__, `APEX_DDPG <https://github.com/ray-project/ray/blob/master/rllib/algorithms/apex_ddpg/apex_ddpg.py>`__) use a single GPU learner and many CPU workers for experience collection. Experience collection can scale to hundreds of CPU workers due to the distributed prioritization of experience prior to storage in replay buffers.
+
+.. figure:: images/apex-arch.svg
+
+    Ape-X architecture
+
+Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_dqn/pong-apex-dqn.yaml>`__, `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_ddpg/pendulum-apex-ddpg.yaml>`__, `MountainCarContinuous-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_ddpg/mountaincarcontinuous-apex-ddpg.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/apex_dqn/atari-apex-dqn.yaml>`__.
+
+**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
+
+=============  ================================  ========================================
+ Atari env     RLlib Ape-X 8-workers             Mnih et al Async DQN 16-workers
+=============  ================================  ========================================
+BeamRider      6134                              ~6000
+Breakout       123                               ~50
+Qbert          15302                             ~1200
+SpaceInvaders  686                               ~600
+=============  ================================  ========================================
+
+**Scalability**:
+
+=============  ================================  ========================================
+ Atari env     RLlib Ape-X 8-workers @1 hour     Mnih et al Async DQN 16-workers @1 hour
+=============  ================================  ========================================
+BeamRider      4873                              ~1000
+Breakout       77                                ~10
+Qbert          4083                              ~500
+SpaceInvaders  646                               ~300
+=============  ================================  ========================================
+
+.. figure:: images/apex.png
+
+    Ape-X using 32 workers in RLlib vs vanilla DQN (orange) and A3C (blue) on PongNoFrameskip-v4.
+
+**Ape-X specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/apex_dqn/apex_dqn.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+   
+.. _r2d2:
+
+Recurrent Replay Distributed DQN (R2D2)
+---------------------------------------
+|pytorch| |tensorflow|
+`[paper] <https://openreview.net/pdf?id=r1lyTjAqYX>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/r2d2/r2d2.py>`__
+R2D2 can be scaled by increasing the number of workers. All of the DQN improvements evaluated in `Rainbow <https://arxiv.org/abs/1710.02298>`__ are available, though not all are enabled by default.
+
+Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/r2d2/stateless-cartpole-r2d2.yaml>`__
+
+.. _dqn:
+
+Deep Q Networks (DQN, Rainbow, Parametric DQN)
+----------------------------------------------
+|pytorch| |tensorflow|
+`[paper] <https://arxiv.org/abs/1312.5602>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/dqn/dqn.py>`__
+DQN can be scaled by increasing the number of workers or using Ape-X. Memory usage is reduced by compressing samples in the replay buffer with LZ4. All of the DQN improvements evaluated in `Rainbow <https://arxiv.org/abs/1710.02298>`__ are available, though not all are enabled by default. See also how to use `parametric-actions in DQN <rllib-models.html#variable-length-parametric-action-spaces>`__.
+
+.. figure:: images/dqn-arch.svg
+
+    DQN architecture
+
+Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/pong-dqn.yaml>`__, `Rainbow configuration <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/pong-rainbow.yaml>`__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/atari-dqn.yaml>`__, `with Dueling and Double-Q <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/atari-duel-ddqn.yaml>`__, `with Distributional DQN <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dqn/atari-dist-dqn.yaml>`__.
+
+.. tip::
+    Consider using `Ape-X <#distributed-prioritized-experience-replay-ape-x>`__ for faster training with similar timestep efficiency.
+
+.. hint::
+    For a complete `rainbow <https://arxiv.org/pdf/1710.02298.pdf>`__ setup,
+    make the following changes to the default DQN config:
+    ``"n_step": [between 1 and 10],
+    "noisy": True,
+    "num_atoms": [more than 1],
+    "v_min": -10.0,
+    "v_max": 10.0``
+    (set ``v_min`` and ``v_max`` according to your expected range of returns).
+
+**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
+
+=============  ========================  =============================  ==============================  ===============================
+ Atari env     RLlib DQN                 RLlib Dueling DDQN             RLlib Dist. DQN                 Hessel et al. DQN
+=============  ========================  =============================  ==============================  ===============================
+BeamRider      2869                      1910                           4447                            ~2000
+Breakout       287                       312                            410                             ~150
+Qbert          3921                      7968                           15780                           ~4000
+SpaceInvaders  650                       1001                           1025                            ~500
+=============  ========================  =============================  ==============================  ===============================
+
+**DQN-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/dqn/dqn.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
+.. _ddpg:
+
+Deep Deterministic Policy Gradients (DDPG)
+------------------------------------------
+|pytorch| |tensorflow|
+`[paper] <https://arxiv.org/abs/1509.02971>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ddpg/ddpg.py>`__
+DDPG is implemented similarly to DQN (below). The algorithm can be scaled by increasing the number of workers or using Ape-X.
+The improvements from `TD3 <https://spinningup.openai.com/en/latest/algorithms/td3.html>`__ are available as ``TD3``.
+
+.. figure:: images/dqn-arch.svg
+
+    DDPG architecture (same as DQN)
+
+Tuned examples: `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/pendulum-ddpg.yaml>`__, `MountainCarContinuous-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/mountaincarcontinuous-ddpg.yaml>`__, `HalfCheetah-v2 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddpg/halfcheetah-ddpg.yaml>`__.
+
+**DDPG-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/ddpg/ddpg.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
+.. _td3:
+
+Twin Delayed DDPG (TD3)
+-----------------------
+|pytorch| |tensorflow|
+`[paper] <https://arxiv.org/abs/1509.02971>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/td3/td3.py>`__
+TD3 represents an improvement over DDPG. Its implementation is available in RLlib as `TD3 <https://spinningup.openai.com/en/latest/algorithms/td3.html>`__.
+
+Tuned examples: `TD3 Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/td3/pendulum-td3.yaml>`__, `TD3 InvertedPendulum-v2 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/td3/invertedpendulum-td3.yaml>`__, `TD3 Mujoco suite (Ant-v2, HalfCheetah-v2, Hopper-v2, Walker2d-v2) <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/td3/mujoco-td3.yaml>`__.
+
+**TD3-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/td3/td3.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+   
 .. _sac:
 
 Soft Actor Critic (SAC)
@@ -510,22 +615,34 @@ HalfCheetah    13000       ~15000
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
-.. _maml:
+Model-based RL
+~~~~~~~~~~~~~~
 
-Model-Agnostic Meta-Learning (MAML)
------------------------------------
-|pytorch| |tensorflow|
-`[paper] <https://arxiv.org/abs/1703.03400>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/maml/maml.py>`__
+.. _dreamer:
 
-RLlib's MAML implementation is a meta-learning method for learning and quick adaptation across different tasks for continuous control. Code here is adapted from https://github.com/jonasrothfuss, which outperforms vanilla MAML and avoids computation of the higher order gradients during the meta-update step. MAML is evaluated on custom environments that are described in greater detail `here <https://github.com/ray-project/ray/blob/master/rllib/env/apis/task_settable_env.py>`__.
+Dreamer
+-------
+|pytorch|
+`[paper] <https://arxiv.org/abs/1912.01603>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/dreamer/dreamer.py>`__
 
-MAML uses additional metrics to measure performance; ``episode_reward_mean`` measures the agent's returns before adaptation, ``episode_reward_mean_adapt_N`` measures the agent's returns after N gradient steps of inner adaptation, and ``adaptation_delta`` measures the difference in performance before and after adaptation. Examples can be seen `here <https://github.com/ray-project/rl-experiments/tree/master/maml>`__.
+Dreamer is an image-only model-based RL method that learns by imagining trajectories in the future and is evaluated on the DeepMind Control Suite `environments <https://github.com/ray-project/ray/blob/master/rllib/examples/env/dm_control_suite.py>`__. RLlib's Dreamer is adapted from the `official Google research repo <https://github.com/google-research/dreamer>`__.
 
-Tuned examples: HalfCheetahRandDirecEnv (`Env <https://github.com/ray-project/ray/blob/master/rllib/examples/env/halfcheetah_rand_direc.py>`__, `Config <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/maml/halfcheetah-rand-direc-maml.yaml>`__), AntRandGoalEnv (`Env <https://github.com/ray-project/ray/blob/master/rllib/examples/env/ant_rand_goal.py>`__, `Config <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/maml/ant-rand-goal-maml.yaml>`__), PendulumMassEnv (`Env <https://github.com/ray-project/ray/blob/master/rllib/examples/env/pendulum_mass.py>`__, `Config <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/maml/pendulum-mass-maml.yaml>`__)
+To visualize learning, RLlib Dreamer's imagined trajectories are logged as gifs in TensorBoard. Examples of such can be seen `here <https://github.com/ray-project/rl-experiments>`__.
 
-**MAML-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+Tuned examples: `Deepmind Control Environments <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dreamer/dreamer-deepmind-control.yaml>`__
 
-.. literalinclude:: ../../../rllib/algorithms/maml/maml.py
+**Deepmind Control results @1M steps:** `more details <https://github.com/ray-project/rl-experiments>`__
+
+=============  ==============  ======================
+DMC env        RLlib Dreamer   Danijar et al Dreamer
+=============  ==============  ======================
+Walker-Walk    920             ~930
+Cheetah-Run    640             ~800
+=============  ==============  ======================
+
+**Dreamer-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/dreamer/dreamer.py
    :language: python
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
@@ -563,100 +680,7 @@ Hopper         620           ~650
    :language: python
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
-
-.. _dreamer:
-
-Dreamer
--------
-|pytorch|
-`[paper] <https://arxiv.org/abs/1912.01603>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/dreamer/dreamer.py>`__
-
-Dreamer is an image-only model-based RL method that learns by imagining trajectories in the future and is evaluated on the DeepMind Control Suite `environments <https://github.com/ray-project/ray/blob/master/rllib/examples/env/dm_control_suite.py>`__. RLlib's Dreamer is adapted from the `official Google research repo <https://github.com/google-research/dreamer>`__.
-
-To visualize learning, RLlib Dreamer's imagined trajectories are logged as gifs in TensorBoard. Examples of such can be seen `here <https://github.com/ray-project/rl-experiments>`__.
-
-Tuned examples: `Deepmind Control Environments <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/dreamer/dreamer-deepmind-control.yaml>`__
-
-**Deepmind Control results @1M steps:** `more details <https://github.com/ray-project/rl-experiments>`__
-
-=============  ==============  ======================
-DMC env        RLlib Dreamer   Danijar et al Dreamer
-=============  ==============  ======================
-Walker-Walk    920             ~930
-Cheetah-Run    640             ~800
-=============  ==============  ======================
-
-**Dreamer-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/dreamer/dreamer.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-.. _slateq:
-
-SlateQ
--------
-|pytorch|
-`[paper] <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/9f91de1fa0ac351ecb12e4062a37afb896aa1463.pdf>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/slateq/slateq.py>`__
-
-SlateQ is a model-free RL method that builds on top of DQN and generates recommendation slates for recommender system environments. Since these types of environments come with large combinatorial action spaces, SlateQ mitigates this by decomposing the Q-value into single-item Q-values and solves the decomposed objective via mixing integer programming and deep learning optimization. SlateQ can be evaluated on Google's RecSim `environment <https://github.com/google-research/recsim>`__. `An RLlib wrapper for RecSim can be found here < <https://github.com/ray-project/ray/blob/master/rllib/env/wrappers/recsim_wrapper.py>`__.
-
-RecSim environment wrapper: `Google RecSim <https://github.com/ray-project/ray/blob/master/rllib/env/wrappers/recsim_wrapper.py>`__
-
-**SlateQ-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/slateq/slateq.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-.. _cql:
-
-Conservative Q-Learning (CQL)
------------------------------------
-|pytorch| |tensorflow|
-`[paper] <https://arxiv.org/abs/2006.04779>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/cql/cql.py>`__
-
-In offline RL, the algorithm has no access to an environment, but can only sample from a fixed dataset of pre-collected state-action-reward tuples.
-In particular, CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution via
-conservative critic estimates. It does so by adding a simple Q regularizer loss to the standard Bellman update loss.
-This ensures that the critic does not output overly-optimistic Q-values. This conservative
-correction term can be added on top of any off-policy Q-learning algorithm (here, we provide this for SAC).
-
-RLlib's CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC- and CQL configs is the ``bc_iters`` parameter in CQL, indicating how many gradient steps we perform over the BC loss. CQL is evaluated on the `D4RL <https://github.com/rail-berkeley/d4rl>`__ benchmark, which has pre-collected offline datasets for many types of environments.
-
-Tuned examples: `HalfCheetah Random <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/cql/halfcheetah-cql.yaml>`__, `Hopper Random <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/cql/hopper-cql.yaml>`__
-
-**CQL-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/cql/cql.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-
-.. _crr:
-
-Critic Regularized Regression (CRR)
------------------------------------
-|pytorch|
-`[paper] <https://arxiv.org/abs/2006.15134>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/crr/crr.py>`__
-
-CRR is another offline RL algorithm based on Q-learning that can learn from an offline experience replay.
-The challenge in applying existing Q-learning algorithms to offline RL lies in the overestimation of the Q-function, as well as, the lack of exploration beyond the observed data.
-The latter becomes increasingly important during bootstrapping in the bellman equation, where the Q-function queried for the next state's Q-value(s) does not have support in the observed data.
-To mitigate these issues, CRR implements a simple and yet powerful idea of "value-filtered regression".
-The key idea is to use a learned critic to filter-out the non-promising transitions from the replay dataset. For more details, please refer to the paper (see link above).
-
-Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/crr/cartpole-v0-crr.yaml>`__, `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/crr/pendulum-v1-crr.yaml>`__
-
-.. literalinclude:: ../../../rllib/algorithms/crr/crr.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-
+   
 Derivative-free
 ~~~~~~~~~~~~~~~
 
@@ -679,8 +703,8 @@ Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rll
 
 .. _es:
 
-Evolution Strategies
---------------------
+Evolution Strategies (ES)
+-------------------------
 |pytorch| |tensorflow|
 `[paper] <https://arxiv.org/abs/1703.03864>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/es/es.py>`__
 Code here is adapted from https://github.com/openai/evolution-strategies-starter to execute in the distributed setting with Ray.
@@ -701,56 +725,32 @@ Tuned examples: `Humanoid-v1 <https://github.com/ray-project/ray/blob/master/rll
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
-.. _marwil:
+RL for recommender systems
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)
------------------------------------------------------------
-|pytorch| |tensorflow|
-`[paper] <http://papers.nips.cc/paper/7866-exponentially-weighted-imitation-learning-for-batched-historical-data>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/marwil/marwil.py>`__
+.. _slateq:
 
-MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on batched historical data.
-When the ``beta`` hyperparameter is set to zero, the MARWIL objective reduces to vanilla imitation learning (see `BC`_).
-MARWIL requires the `offline datasets API <rllib-offline.html>`__ to be used.
+SlateQ
+-------
+|pytorch|
+`[paper] <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/9f91de1fa0ac351ecb12e4062a37afb896aa1463.pdf>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/slateq/slateq.py>`__
 
-Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/marwil/cartpole-marwil.yaml>`__
+SlateQ is a model-free RL method that builds on top of DQN and generates recommendation slates for recommender system environments. Since these types of environments come with large combinatorial action spaces, SlateQ mitigates this by decomposing the Q-value into single-item Q-values and solves the decomposed objective via mixing integer programming and deep learning optimization. SlateQ can be evaluated on Google's RecSim `environment <https://github.com/google-research/recsim>`__. `An RLlib wrapper for RecSim can be found here < <https://github.com/ray-project/ray/blob/master/rllib/env/wrappers/recsim_wrapper.py>`__.
 
-**MARWIL-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+RecSim environment wrapper: `Google RecSim <https://github.com/ray-project/ray/blob/master/rllib/env/wrappers/recsim_wrapper.py>`__
 
-.. literalinclude:: ../../../rllib/algorithms/marwil/marwil.py
+**SlateQ-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/slateq/slateq.py
    :language: python
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
-
-.. _bc:
-
-Behavior Cloning (BC; derived from MARWIL implementation)
----------------------------------------------------------
-|pytorch| |tensorflow|
-`[paper] <http://papers.nips.cc/paper/7866-exponentially-weighted-imitation-learning-for-batched-historical-data>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/bc/bc.py>`__
-
-Our behavioral cloning implementation is directly derived from our `MARWIL`_ implementation,
-with the only difference being the ``beta`` parameter force-set to 0.0. This makes
-BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards.
-BC requires the `offline datasets API <rllib-offline.html>`__ to be used.
-
-Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/bc/cartpole-bc.yaml>`__
-
-**BC-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/bc/bc.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-
-.. _bandits:
-
 Contextual Bandits
 ~~~~~~~~~~~~~~~~~~
 
+.. _bandits:
+
 The Multi-armed bandit (MAB) problem provides a simplified RL setting that
 involves learning to act under one situation only, i.e. the context (observation/state) and arms (actions/items-to-select) are both fixed.
 Contextual bandit is an extension of the MAB problem, where at each
@@ -794,7 +794,6 @@ Tuned examples:
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
-
 .. _lints:
 
 Linear Thompson Sampling (BanditLinTS)
@@ -820,26 +819,17 @@ Tuned examples:
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
+Multi-agent
+~~~~~~~~~~~
 
-.. _alphazero:
+.. _parameter:
 
-Single-Player Alpha Zero (AlphaZero)
-------------------------------------
-|pytorch|
-`[paper] <https://arxiv.org/abs/1712.01815>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/alpha_zero>`__ AlphaZero is an RL agent originally designed for two-player games. This version adapts it to handle single player games. The code can be scaled to any number of workers. It also implements the ranked rewards `(R2) <https://arxiv.org/abs/1807.01672>`__ strategy to enable self-play even in the one-player setting. The code is mainly purposed to be used for combinatorial optimization.
+Parameter Sharing
+-----------------
 
-Tuned examples: `Sparse reward CartPole <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/alpha_zero/cartpole-sparse-rewards-alpha-zero.yaml>`__
+`[paper] <http://ala2017.it.nuigalway.ie/papers/ALA2017_Gupta.pdf>`__, `[paper] <https://arxiv.org/abs/2005.13625>`__ and `[instructions] <rllib-env.html#multi-agent-and-hierarchical>`__. Parameter sharing refers to a class of methods that take a base single agent method, and use it to learn a single policy for all agents. This simple approach has been shown to achieve state of the art performance in cooperative games, and is usually how you should start trying to learn a multi-agent problem.
 
-**AlphaZero-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
-
-.. literalinclude:: ../../../rllib/algorithms/alpha_zero/alpha_zero.py
-   :language: python
-   :start-after: __sphinx_doc_begin__
-   :end-before: __sphinx_doc_end__
-
-
-Multi-Agent Methods
-~~~~~~~~~~~~~~~~~~~
+Tuned examples: `PettingZoo <https://github.com/PettingZoo-Team/PettingZoo>`__, `waterworld <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_parameter_sharing.py>`__, `rock-paper-scissors <https://github.com/ray-project/ray/blob/master/rllib/examples/rock_paper_scissors_multiagent.py>`__, `multi-agent cartpole <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py>`__
 
 .. _qmix:
 
@@ -873,23 +863,6 @@ Tuned examples: `Multi-Agent Particle Environment <https://github.com/wsjeon/mad
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
-.. _parameter:
-
-Parameter Sharing
------------------
-
-`[paper] <http://ala2017.it.nuigalway.ie/papers/ALA2017_Gupta.pdf>`__, `[paper] <https://arxiv.org/abs/2005.13625>`__ and `[instructions] <rllib-env.html#multi-agent-and-hierarchical>`__. Parameter sharing refers to a class of methods that take a base single agent method, and use it to learn a single policy for all agents. This simple approach has been shown to achieve state of the art performance in cooperative games, and is usually how you should start trying to learn a multi-agent problem.
-
-Tuned examples: `PettingZoo <https://github.com/PettingZoo-Team/PettingZoo>`__, `waterworld <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_parameter_sharing.py>`__, `rock-paper-scissors <https://github.com/ray-project/ray/blob/master/rllib/examples/rock_paper_scissors_multiagent.py>`__, `multi-agent cartpole <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py>`__
-
-.. _fil:
-
-Fully Independent Learning
---------------------------
-`[instructions] <rllib-env.html#multi-agent-and-hierarchical>`__ Fully independent learning involves a collection of agents learning independently of each other via single agent methods. This typically works, but can be less effective than dedicated multi-agent RL methods, since they do not account for the non-stationarity of the multi-agent environment.
-
-Tuned examples: `waterworld <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_independent_learning.py>`__, `multiagent-cartpole <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py>`__
-
 .. _sc:
 
 Shared Critic Methods
@@ -899,10 +872,25 @@ Shared Critic Methods
 
 Tuned examples: `TwoStepGame <https://github.com/ray-project/ray/blob/master/rllib/examples/centralized_critic_2.py>`__
 
+Others
+~~~~~~
 
-Exploration-based plug-ins (can be combined with any algo)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. _alphazero:
 
+Single-Player Alpha Zero (AlphaZero)
+------------------------------------
+|pytorch|
+`[paper] <https://arxiv.org/abs/1712.01815>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/alpha_zero>`__ AlphaZero is an RL agent originally designed for two-player games. This version adapts it to handle single player games. The code can be scaled to any number of workers. It also implements the ranked rewards `(R2) <https://arxiv.org/abs/1807.01672>`__ strategy to enable self-play even in the one-player setting. The code is mainly purposed to be used for combinatorial optimization.
+
+Tuned examples: `Sparse reward CartPole <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/alpha_zero/cartpole-sparse-rewards-alpha-zero.yaml>`__
+
+**AlphaZero-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../../rllib/algorithms/alpha_zero/alpha_zero.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+   
 .. _curiosity:
 
 Curiosity (ICM: Intrinsic Curiosity Module)
@@ -969,8 +957,6 @@ Intrinsic rewards for each env-step are calculated by taking the euclidian dista
 This allows the agent to explore areas of the environment, where the "forward" model still performs poorly (are not "understood" yet), whereas exploration to these areas will taper down after the agent has visited them
 often: The "forward" model will eventually get better at predicting these next latent vectors, which in turn will diminish the intrinsic rewards (decrease the euclidian distance between predicted and actual vectors).
 
-
-
 .. _re3:
 
 RE3 (Random Encoders for Efficient Exploration)
@@ -1031,6 +1017,14 @@ using the state entropy as "intrinsic rewards".
 This exploration objective can be used with both model-free and model-based RL algorithms. 
 RE3 uses a randomly initialized encoder to get the state’s latent representation, thus taking away the complexity of training the representation learning method. The encoder weights are fixed during the entire duration of the training process. 
 
+.. _fil:
+
+Fully Independent Learning
+--------------------------
+`[instructions] <rllib-env.html#multi-agent-and-hierarchical>`__ Fully independent learning involves a collection of agents learning independently of each other via single agent methods. This typically works, but can be less effective than dedicated multi-agent RL methods, since they do not account for the non-stationarity of the multi-agent environment.
+
+Tuned examples: `waterworld <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_independent_learning.py>`__, `multiagent-cartpole <https://github.com/ray-project/ray/blob/master/rllib/examples/multi_agent_cartpole.py>`__
+
 
 .. |tensorflow| image:: images/tensorflow.png
     :class: inline-figure
@@ -1038,6 +1032,4 @@ RE3 uses a randomly initialized encoder to get the state’s latent representati
 
 .. |pytorch| image:: images/pytorch.png
     :class: inline-figure
-    :width: 24
-
-.. include:: /_includes/rllib/announcement_bottom.rst
+    :width: 24
\ No newline at end of file
diff --git a/rllib/README.rst b/rllib/README.rst
index a4603aacd..f557b5f48 100644
--- a/rllib/README.rst
+++ b/rllib/README.rst
@@ -66,7 +66,7 @@ Offline RL:
 - `Importance Sampling and Weighted Importance Sampling (OPE) <https://docs.ray.io/en/latest/rllib/rllib-offline.html#is>`__
 - `Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#marwil>`__ 
 
-Model-free On-policy RL (for Games):
+Model-free On-policy RL:
 
 - `Synchronous Proximal Policy Optimization (APPO) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#appo>`__ 
 - `Decentralized Distributed Proximal Policy Optimization (DD-PPO)  <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#ddppo>`__ 
@@ -105,7 +105,6 @@ Bandits:
 
 Multi-agent:  
 
-- `Single-Player Alpha Zero (AlphaZero)  <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#alphazero>`__
 - `Parameter Sharing <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#parameter>`__ 
 - `QMIX Monotonic Value Factorisation (QMIX, VDN, IQN)) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#qmix>`__ 
 - `Multi-Agent Deep Deterministic Policy Gradient (MADDPG) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#maddpg>`__
@@ -113,6 +112,7 @@ Multi-agent:
 
 Others:  
 
+- `Single-Player Alpha Zero (AlphaZero)  <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#alphazero>`__
 - `Curiosity (ICM: Intrinsic Curiosity Module) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#curiosity>`__ 
 - `Random encoders (contrib/RE3) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#re3>`__ 
 - `Fully Independent Learning <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#fil>`__