diff --git a/doc/source/a2c-arch.svg b/doc/source/a2c-arch.svg
new file mode 100644
index 000000000..65c662d88
--- /dev/null
+++ b/doc/source/a2c-arch.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/doc/source/apex-arch.svg b/doc/source/apex-arch.svg
new file mode 100644
index 000000000..ca9650bd7
--- /dev/null
+++ b/doc/source/apex-arch.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/doc/source/dqn-arch.svg b/doc/source/dqn-arch.svg
new file mode 100644
index 000000000..fa1d2e5cd
--- /dev/null
+++ b/doc/source/dqn-arch.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/doc/source/impala-arch.svg b/doc/source/impala-arch.svg
new file mode 100644
index 000000000..c519670a5
--- /dev/null
+++ b/doc/source/impala-arch.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/doc/source/ppo-arch.svg b/doc/source/ppo-arch.svg
new file mode 100644
index 000000000..7e0191ad1
--- /dev/null
+++ b/doc/source/ppo-arch.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/doc/source/rllib-algorithms.rst b/doc/source/rllib-algorithms.rst
index 26e29bd92..d00e59b5b 100644
--- a/doc/source/rllib-algorithms.rst
+++ b/doc/source/rllib-algorithms.rst
@@ -10,6 +10,10 @@ Distributed Prioritized Experience Replay (Ape-X)
`[implementation] `__
Ape-X variations of DQN, DDPG, and QMIX (`APEX_DQN `__, `APEX_DDPG `__, `APEX_QMIX `__) use a single GPU learner and many CPU workers for experience collection. Experience collection can scale to hundreds of CPU workers due to the distributed prioritization of experience prior to storage in replay buffers.
+.. figure:: apex-arch.svg
+
+ Ape-X architecture
+
Tuned examples: `PongNoFrameskip-v4 `__, `Pendulum-v0 `__, `MountainCarContinuous-v0 `__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 `__.
**Atari results @10M steps**: `more details `__
@@ -52,6 +56,10 @@ Importance Weighted Actor-Learner Architecture (IMPALA)
`[implementation] `__
In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's IMPALA implementation uses DeepMind's reference `V-trace code `__. Note that we do not provide a deep residual network out of the box, but one can be plugged in as a `custom model `__. Multiple learner GPUs and experience replay are also supported.
+.. figure:: impala-arch.svg
+
+ IMPALA architecture
+
Tuned examples: `PongNoFrameskip-v4 `__, `vectorized configuration `__, `multi-gpu configuration `__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 `__
**Atari results @10M steps**: `more details `__
@@ -97,6 +105,10 @@ We include an asynchronous variant of Proximal Policy Optimization (PPO) based o
APPO is not always more efficient; it is often better to simply use `PPO `__ or `IMPALA `__.
+.. figure:: impala-arch.svg
+
+ APPO architecture (same as IMPALA)
+
Tuned examples: `PongNoFrameskip-v4 `__
**APPO-specific configs** (see also `common configs `__):
@@ -114,6 +126,10 @@ Advantage Actor-Critic (A2C, A3C)
`[paper] `__ `[implementation] `__
RLlib implements A2C and A3C using SyncSamplesOptimizer and AsyncGradientsOptimizer respectively for policy optimization. These algorithms scale to up to 16-32 worker processes depending on the environment. Both a TensorFlow (LSTM), and PyTorch version are available.
+.. figure:: a2c-arch.svg
+
+ A2C architecture
+
Tuned examples: `PongDeterministic-v4 `__, `PyTorch version `__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 `__
.. tip::
@@ -140,7 +156,11 @@ SpaceInvaders 692 ~600
Deep Deterministic Policy Gradients (DDPG, TD3)
-----------------------------------------------
`[paper] `__ `[implementation] `__
-DDPG is implemented similarly to DQN (below). The algorithm can be scaled by increasing the number of workers, switching to AsyncGradientsOptimizer, or using Ape-X. The improvements from `TD3 `__ are available though not enabled by default.
+DDPG is implemented similarly to DQN (below). The algorithm can be scaled by increasing the number of workers, switching to AsyncGradientsOptimizer, or using Ape-X. The improvements from `TD3 `__ are available as ``TD3``.
+
+.. figure:: dqn-arch.svg
+
+ DDPG architecture (same as DQN)
Tuned examples: `Pendulum-v0 `__, `MountainCarContinuous-v0 `__, `HalfCheetah-v2 `__, `TD3 Pendulum-v0 `__, `TD3 InvertedPendulum-v2 `__, `TD3 Mujoco suite (Ant-v2, HalfCheetah-v2, Hopper-v2, Walker2d-v2) `__.
@@ -156,6 +176,10 @@ Deep Q Networks (DQN, Rainbow, Parametric DQN)
`[paper] `__ `[implementation] `__
RLlib DQN is implemented using the SyncReplayOptimizer. The algorithm can be scaled by increasing the number of workers, using the AsyncGradientsOptimizer for async DQN, or using Ape-X. Memory usage is reduced by compressing samples in the replay buffer with LZ4. All of the DQN improvements evaluated in `Rainbow `__ are available, though not all are enabled by default. See also how to use `parametric-actions in DQN `__.
+.. figure:: dqn-arch.svg
+
+ DQN architecture
+
Tuned examples: `PongDeterministic-v4 `__, `Rainbow configuration `__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 `__, `with Dueling and Double-Q `__, `with Distributional DQN `__.
.. tip::
@@ -183,6 +207,10 @@ Policy Gradients
----------------
`[paper] `__ `[implementation] `__ We include a vanilla policy gradients implementation as an example algorithm in both TensorFlow and PyTorch. This is usually outperformed by PPO.
+.. figure:: ppo-arch.svg
+
+ Policy gradients architecture (same as A2C)
+
Tuned examples: `CartPole-v0 `__
**PG-specific configs** (see also `common configs `__):
@@ -197,6 +225,10 @@ Proximal Policy Optimization (PPO)
`[paper] `__ `[implementation] `__
PPO's clipped objective supports multiple SGD passes over the same batch of experiences. RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive implementation. RLlib's PPO scales out using multiple workers for experience collection, and also with multiple GPUs for SGD.
+.. figure:: ppo-arch.svg
+
+ PPO architecture
+
Tuned examples: `Humanoid-v1 `__, `Hopper-v1 `__, `Pendulum-v0 `__, `PongDeterministic-v4 `__, `Walker2d-v1 `__, `HalfCheetah-v2 `__, `{BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4 `__
@@ -236,6 +268,10 @@ Soft Actor Critic (SAC)
------------------------
`[paper] `__ `[implementation] `__
+.. figure:: dqn-arch.svg
+
+ SAC architecture (same as DQN)
+
RLlib's soft-actor critic implementation is ported from the `official SAC repo `__ to better integrate with RLlib APIs. Note that SAC has two fields to configure for custom models: ``policy_model`` and ``Q_model``, and currently has no support for non-continuous action distributions. It is also currently *experimental*.
Tuned examples: `Pendulum-v0 `__