ray/rllib/algorithms/ppo
2022-06-14 10:57:27 +02:00
..
tests [RLlib] Trainer to Algorithm renaming. (#25539) 2022-06-11 15:10:39 +02:00
__init__.py [RLlib] Move all remaining algos into algorithms directory. (#25366) 2022-06-04 07:35:24 +02:00
ppo.py [RLlib] Remove execution plan code no longer used by RLlib. (#25624) 2022-06-14 10:57:27 +02:00
ppo_tf_policy.py [RLlib] Trainer to Algorithm renaming. (#25539) 2022-06-11 15:10:39 +02:00
ppo_torch_policy.py [RLlib] Move all remaining algos into algorithms directory. (#25366) 2022-06-04 07:35:24 +02:00
README.md [RLlib] Move all remaining algos into algorithms directory. (#25366) 2022-06-04 07:35:24 +02:00

Proximal Policy Optimization (PPO)

Overview

PPO is a model-free on-policy RL algorithm that works well for both discrete and continuous action space environments. PPO utilizes an actor-critic framework, where there are two networks, an actor (policy network) and critic network (value function).

There are two formulations of PPO, which are both implemented in RLlib. The first formulation of PPO imitates the prior paper TRPO without the complexity of second-order optimization. In this formulation, for every iteration, an old version of an actor-network is saved and the agent seeks to optimize the RL objective while staying close to the old policy. This makes sure that the agent does not destabilize during training. In the second formulation, To mitigate destructive large policy updates, an issue discovered for vanilla policy gradient methods, PPO introduces the surrogate objective, which clips large action probability ratios between the current and old policy. Clipping has been shown in the paper to significantly improve training stability and speed.

Distributed PPO Algorithms

PPO is a core algorithm in RLlib due to its ability to scale well with the number of nodes.

In RLlib, we provide various implementations of distributed PPO, with different underlying execution plans, as shown below:

Distributed baseline PPO ..

.. is a synchronous distributed RL algorithm (this algo here). Data collection nodes, which represent the old policy, gather data synchronously to create a large pool of on-policy data from which the agent performs minibatch gradient descent on.

Asychronous PPO (APPO)

See implementation here

Decentralized Distributed PPO (DDPPO)

See implementation here

Documentation & Implementation:

Proximal Policy Optimization (PPO).

Detailed Documentation

Implementation