ray/rllib/agents/ppo
gjoliver 99a0088233
[RLlib] Unify the way we create local replay buffer for all agents (#19627)
* [RLlib] Unify the way we create and use LocalReplayBuffer for all the agents.

This change
1. Get rid of the try...except clause when we call execution_plan(),
   and get rid of the Deprecation warning as a result.
2. Fix the execution_plan() call in Trainer._try_recover() too.
3. Most importantly, makes it much easier to create and use different types
   of local replay buffers for all our agents.
   E.g., allow us to easily create a reservoir sampling replay buffer for
   APPO agent for Riot in the near future.
* Introduce explicit configuration for replay buffer types.
* Fix is_training key error.
* actually deprecate buffer_size field.
2021-10-26 20:56:02 +02:00
..
tests [RLlib] Some minor cleanups (buffer buffer_size -> capacity and others). (#19623) 2021-10-25 09:42:39 +02:00
__init__.py [RLlib] Examples folder restructuring (models) part 1 (#8353) 2020-05-08 08:20:18 +02:00
appo.py [RLlib] Refactor: All tf static graph code should reside inside Policy class. (#17169) 2021-07-20 14:58:13 -04:00
appo_tf_policy.py [RLlib] Fix failing test cases: Soft-deprecate ModelV2.from_batch (in favor of ModelV2.__call__). (#19693) 2021-10-25 15:00:00 +02:00
appo_torch_policy.py [RLlib] Fix failing test cases: Soft-deprecate ModelV2.from_batch (in favor of ModelV2.__call__). (#19693) 2021-10-25 15:00:00 +02:00
ddppo.py [RLlib] Unify the way we create local replay buffer for all agents (#19627) 2021-10-26 20:56:02 +02:00
ppo.py [RLlib] Unify the way we create local replay buffer for all agents (#19627) 2021-10-26 20:56:02 +02:00
ppo_tf_policy.py [RLlib] Fix failing test cases: Soft-deprecate ModelV2.from_batch (in favor of ModelV2.__call__). (#19693) 2021-10-25 15:00:00 +02:00
ppo_torch_policy.py [RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. (#18937) 2021-10-04 13:29:00 +02:00
README.md [RLlib] Improved Documentation for PPO, DDPG, and SAC (#12943) 2020-12-24 09:31:35 -05:00

Proximal Policy Optimization (PPO)

Overview

PPO is a model-free on-policy RL algorithm that works well for both discrete and continuous action space environments. PPO utilizes an actor-critic framework, where there are two networks, an actor (policy network) and critic network (value function).

There are two formulations of PPO, which are both implemented in RLlib. The first formulation of PPO imitates the prior paper TRPO without the complexity of second-order optimization. In this formulation, for every iteration, an old version of an actor-network is saved and the agent seeks to optimize the RL objective while staying close to the old policy. This makes sure that the agent does not destabilize during training. In the second formulation, To mitigate destructive large policy updates, an issue discovered for vanilla policy gradient methods, PPO introduces the surrogate objective, which clips large action probability ratios between the current and old policy. Clipping has been shown in the paper to significantly improve training stability and speed.

Distributed PPO Algorithms

PPO is a core algorithm in RLlib due to its ability to scale well with the number of nodes. In RLlib, we provide various implementation of distributed PPO, with different underlying execution plans, as shown below.

Distributed baseline PPO is a synchronous distributed RL algorithm. Data collection nodes, which represent the old policy, gather data synchronously to create a large pool of on-policy data from which the agent performs minibatch gradient descent on.

On the other hand, Asychronous PPO (APPO) opts to imitate IMPALA as its distributed execution plan. Data collection nodes gather data asynchronously, which are collected in a circular replay buffer. A target network and doubly-importance sampled surrogate objective is introduced to enforce training stability in the asynchronous data-collection setting.

Lastly, Decentralized Distributed PPO (DDPPO) removes the assumption that gradient-updates must be done on a central node. Instead, gradients are computed remotely on each data collection node and all-reduced at each mini-batch using torch distributed. This allows each workers GPU to be used both for sampling and for training.

Documentation & Implementation:

  1. Proximal Policy Optimization (PPO).

    Detailed Documentation

    Implementation

  2. Asynchronous Proximal Policy Optimization (APPO).

    Detailed Documentation

    Implementation

  3. Decentralized Distributed Proximal Policy Optimization (DDPPO)

    Detailed Documentation

    Implementation