ray/rllib/agents/ppo
2021-05-20 18:15:10 +02:00
..
tests [RLlib] Entropy coeff schedule bug fix and git bisect script. (#15937) 2021-05-20 18:15:10 +02:00
__init__.py [RLlib] Examples folder restructuring (models) part 1 (#8353) 2020-05-08 08:20:18 +02:00
appo.py [RLlib] Fix inconsistency wrt batch size in SampleCollector (traj. view API). Makes DD-PPO work with traj. view API. (#12063) 2020-11-19 19:01:14 +01:00
appo_tf_policy.py [RLlib] CQL TensorFlow support (#15841) 2021-05-18 11:10:46 +02:00
appo_torch_policy.py [RLlib] Issue 9071 A3C w/ RNN not working due to VF assuming no RNN. (#13238) 2021-01-19 14:22:36 +01:00
ddppo.py [RLlib] DD-PPO not supported on Win (add meaningful error message). (#15631) 2021-05-04 19:26:17 +02:00
ppo.py [RLlib] Discussion 2022: PPO should auto-adjust rollout_fragment_length if other settings do not align with train_batch_size. (#15611) 2021-05-10 16:16:02 +02:00
ppo_tf_policy.py [RLlib] CQL TensorFlow support (#15841) 2021-05-18 11:10:46 +02:00
ppo_torch_policy.py [RLlib] Discussion 2021: PPO does not learn vf, iff use_gae=False (ignores use_critic setting). (#15610) 2021-05-04 14:17:00 +02:00
README.md [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943) 2020-12-24 09:31:35 -05:00

Proximal Policy Optimization (PPO)

Overview

PPO is a model-free on-policy RL algorithm that works well for both discrete and continuous action space environments. PPO utilizes an actor-critic framework, where there are two networks, an actor (policy network) and critic network (value function).

There are two formulations of PPO, which are both implemented in RLlib. The first formulation of PPO imitates the prior paper TRPO without the complexity of second-order optimization. In this formulation, for every iteration, an old version of an actor-network is saved and the agent seeks to optimize the RL objective while staying close to the old policy. This makes sure that the agent does not destabilize during training. In the second formulation, To mitigate destructive large policy updates, an issue discovered for vanilla policy gradient methods, PPO introduces the surrogate objective, which clips large action probability ratios between the current and old policy. Clipping has been shown in the paper to significantly improve training stability and speed.

Distributed PPO Algorithms

PPO is a core algorithm in RLlib due to its ability to scale well with the number of nodes. In RLlib, we provide various implementation of distributed PPO, with different underlying execution plans, as shown below.

Distributed baseline PPO is a synchronous distributed RL algorithm. Data collection nodes, which represent the old policy, gather data synchronously to create a large pool of on-policy data from which the agent performs minibatch gradient descent on.

On the other hand, Asychronous PPO (APPO) opts to imitate IMPALA as its distributed execution plan. Data collection nodes gather data asynchronously, which are collected in a circular replay buffer. A target network and doubly-importance sampled surrogate objective is introduced to enforce training stability in the asynchronous data-collection setting.

Lastly, Decentralized Distributed PPO (DDPPO) removes the assumption that gradient-updates must be done on a central node. Instead, gradients are computed remotely on each data collection node and all-reduced at each mini-batch using torch distributed. This allows each workers GPU to be used both for sampling and for training.

Documentation & Implementation:

  1. Proximal Policy Optimization (PPO).

    Detailed Documentation

    Implementation

  2. Asynchronous Proximal Policy Optimization (APPO).

    Detailed Documentation

    Implementation

  3. Decentralized Distributed Proximal Policy Optimization (DDPPO)

    Detailed Documentation

    Implementation