hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 18:11:42 -05:00

History

Jun Gong e6e10ce4cf [RLlib] Revert `41c9ef70`. (#27243 ) Why are these changes needed? Also: Add validation to make sure multi-gpu and micro-batch is not used together. Update A2C learning test to hit the microbatching branch. Minor comment updates.		2022-07-29 11:05:15 -07:00
..
tests	[RLlib] Trainer.training_iteration -> Trainer.training_step; Iterations vs reportings: Clarification of terms. (#25076 )	2022-06-10 17:09:18 +02:00
__init__.py	[RLlib] A2C + A3C move to `algorithms` folder and re-name into A2C/A3C (from ...Trainer). (#25314 )	2022-06-01 09:29:16 +02:00
a3c.py	[RLlib] Revert `41c9ef70`. (#27243 )	2022-07-29 11:05:15 -07:00
a3c_tf_policy.py	[RLlib] Fix a bunch of issues related to connectors. (#26510 )	2022-07-13 18:55:20 +02:00
a3c_torch_policy.py	[RLlib] A2C + A3C move to `algorithms` folder and re-name into A2C/A3C (from ...Trainer). (#25314 )	2022-06-01 09:29:16 +02:00
README.md	[RLlib] A2C + A3C move to `algorithms` folder and re-name into A2C/A3C (from ...Trainer). (#25314 )	2022-06-01 09:29:16 +02:00

README.md

Asynchronous Advantage Actor-Critic (A3C)

Overview

Advantage Actor-Critic proposes two distributed model-free on-policy RL algorithms, A3C and A2C. These algorithms are distributed versions of the vanilla Policy Gradient (PG) algorithm with different distributed execution patterns. The paper suggests accelerating training via scaling data collection, i.e. introducing worker nodes, which carry copies of the central node's policy network and collect data from the environment in parallel. This data is used on each worker to compute gradients. The central node applies each of these gradients and then sends updated weights back to the workers.

In A3C, the worker nodes generate data asynchronously and compute gradients from the data. These computed gradients are then sent to the central node. Note that the workers in A3C may be slightly out-of-sync with the central node due to asynchrony, which may induce biases in learning (on-policy loss function).

Documentation & Implementation of A3C:

Detailed Documentation

Implementation