ray/rllib/algorithms/ddppo
Kai Fricke 6313ddc47c
[tune] Refactor Syncer / deprecate Sync client (#25655)
This PR includes / depends on #25709

The two concepts of Syncer and SyncClient are confusing, as is the current API for passing custom sync functions.

This PR refactors Tune's syncing behavior. The Sync client concept is hard deprecated. Instead, we offer a well defined Syncer API that can be extended to provide own syncing functionality. However, the default will be to use Ray AIRs file transfer utilities.

New API:
- Users can pass `syncer=CustomSyncer` which implements the `Syncer` API
- Otherwise our off-the-shelf syncing is used
- As before, syncing to cloud disables syncing to driver

Changes:
- Sync client is removed
- Syncer interface introduced
- _DefaultSyncer is a wrapper around the URI upload/download API from Ray AIR
- SyncerCallback only uses remote tasks to synchronize data
- Rsync syncing is fully depracated and removed
- Docker and kubernetes-specific syncing is fully deprecated and removed
- Testing is improved to use `file://` URIs instead of mock sync clients
2022-06-14 14:46:30 +02:00
..
tests [RLlib] Move all remaining algos into algorithms directory. (#25366) 2022-06-04 07:35:24 +02:00
__init__.py [RLlib] Move all remaining algos into algorithms directory. (#25366) 2022-06-04 07:35:24 +02:00
ddppo.py [tune] Refactor Syncer / deprecate Sync client (#25655) 2022-06-14 14:46:30 +02:00
README.md [RLlib] Move all remaining algos into algorithms directory. (#25366) 2022-06-04 07:35:24 +02:00

Decentralized Distributed Proximal Policy Optimization (DDPPO)

Overview

PPO is a model-free on-policy RL algorithm that works well for both discrete and continuous action space environments. PPO utilizes an actor-critic framework, where there are two networks, an actor (policy network) and critic network (value function).

Distributed PPO Algorithms

Distributed baseline PPO

See implementation here

Asychronous PPO (APPO)

See implementation here

Decentralized Distributed PPO (DDPPO) ..

.. removes the assumption that gradient-updates must be done on a central node. Instead, gradients are computed remotely on each data collection node and all-reduced at each mini-batch using torch distributed. This allows each workers GPU to be used both for sampling and for training.

See implementation here

Documentation & Implementation:

Decentralized Distributed Proximal Policy Optimization (DDPPO)

Detailed Documentation

Implementation