ray/doc/source/policy-optimizers.rst

Policy Optimizers
=================

RLlib supports using its policy optimizer implementations from external algorithms.

Example of constructing and using a policy optimizer `(link to full example) <https://github.com/ericl/baselines/blob/rllib-example/baselines/deepq/run_simple_loop.py>`__:

.. code-block:: python

    ray.init()
    env_creator = lambda env_config: gym.make("PongNoFrameskip-v4")
    optimizer = LocalSyncReplayOptimizer.make(
        YourEvaluatorClass, [env_creator], num_workers=0, optimizer_config={})

    i = 0
    while optimizer.num_steps_sampled < 100000:
        i += 1
        print("== optimizer step {} ==".format(i))
        optimizer.step()
        print("optimizer stats", optimizer.stats())
        print("local evaluator stats", optimizer.local_evaluator.stats())

Read more about policy optimizers in this post: `Distributed Policy Optimizers for Scalable and Reproducible Deep RL <https://rise.cs.berkeley.edu/blog/distributed-policy-optimizers-for-scalable-and-reproducible-deep-rl/>`__.

Here are the steps for using a RLlib policy optimizer with an existing algorithm.

1. Implement the `Policy evaluator interface <rllib-dev.html#policy-evaluators-and-optimizers>`__.

    - Here is an example of porting a `PyTorch Rainbow implementation <https://github.com/ericl/Rainbow/blob/rllib-example/rainbow_evaluator.py>`__.

    - Another example porting a `TensorFlow DQN implementation <https://github.com/ericl/baselines/blob/rllib-example/baselines/deepq/dqn_evaluator.py>`__.

2. Pick a `Policy optimizer class <https://github.com/ray-project/ray/tree/master/python/ray/rllib/optimizers>`__. The `LocalSyncOptimizer <https://github.com/ray-project/ray/blob/master/python/ray/rllib/optimizers/local_sync.py>`__ is a reasonable choice for local testing. You can also implement your own. Policy optimizers can be constructed using their ``make`` method (e.g., ``LocalSyncOptimizer.make(evaluator_cls, evaluator_args, num_workers, optimizer_config)``), or you can construct them by passing in a list of evaluators instantiated as Ray actors.

    - Here is code showing the `simple Policy Gradient agent <https://github.com/ray-project/ray/blob/master/python/ray/rllib/pg/pg.py>`__ using ``make()``.

    - A different example showing an `A3C agent <https://github.com/ray-project/ray/blob/master/python/ray/rllib/a3c/a3c.py>`__ passing in Ray actors directly.

3. Decide how you want to drive the training loop.

    - Option 1: call ``optimizer.step()`` from some existing training code. Training statistics can be retrieved by querying the ``optimizer.local_evaluator`` evaluator instance, or mapping over the remote evaluators (e.g., ``ray.get([ev.some_fn.remote() for ev in optimizer.remote_evaluators])``) if you are running with multiple workers.

    - Option 2: define a full RLlib `Agent class <https://github.com/ray-project/ray/blob/master/python/ray/rllib/agent.py>`__. This might be preferable if you don't have an existing training harness or want to use features provided by `Ray Tune <tune.html>`__.

Available Policy Optimizers
---------------------------

+-----------------------------+---------------------+-----------------+------------------------------+
| **Policy optimizer class**  | **Operating range** | **Works with**  | **Description**              |
+=============================+=====================+=================+==============================+
|AsyncOptimizer               |1-10s of CPUs        |(any)            |Asynchronous gradient-based   |
|                             |                     |                 |optimization (e.g., A3C)      |
+-----------------------------+---------------------+-----------------+------------------------------+
|LocalSyncOptimizer           |0-1 GPUs +           |(any)            |Synchronous gradient-based    |
|                             |1-100s of CPUs       |                 |optimization with parallel    |
|                             |                     |                 |sample collection             |
+-----------------------------+---------------------+-----------------+------------------------------+
|LocalSyncReplayOptimizer     |0-1 GPUs +           | Off-policy      |Adds a replay buffer          |
|                             |1-100s of CPUs       | algorithms      |to LocalSyncOptimizer         |
+-----------------------------+---------------------+-----------------+------------------------------+
|LocalMultiGPUOptimizer       |0-10 GPUs +          | Algorithms      |Implements data-parallel      |
|                             |1-100s of CPUs       | written in      |optimization over multiple    |
|                             |                     | TensorFlow      |GPUs, e.g., for PPO           |
+-----------------------------+---------------------+-----------------+------------------------------+
|ApexOptimizer                |1 GPU +              | Off-policy      |Implements the Ape-X          |
|                             |10-100s of CPUs      | algorithms      |distributed prioritization    |
|                             |                     | w/sample        |algorithm                     |
|                             |                     | prioritization  |                              |
+-----------------------------+---------------------+-----------------+------------------------------+
[rllib] [docs] Cleanup RLlib API and make docs consistent with upcoming blog post (#1708) * wip * more work * fix apex * docs * apex doc * pool comment * clean up * make wrap stack pluggable * Mon Mar 12 21:45:50 PDT 2018 * clean up comment * table * Mon Mar 12 22:51:57 PDT 2018 * Mon Mar 12 22:53:05 PDT 2018 * Mon Mar 12 22:55:03 PDT 2018 * Mon Mar 12 22:56:18 PDT 2018 * Mon Mar 12 22:59:54 PDT 2018 * Update apex_optimizer.py * Update index.rst * Update README.rst * Update README.rst * comments * Wed Mar 14 19:01:02 PDT 2018 2018-03-15 15:57:31 -07:00			`Policy Optimizers`
			`=================`

[rllib] remove redundant docs (#1728) * wip * more work * fix apex * docs * apex doc * pool comment * clean up * make wrap stack pluggable * Mon Mar 12 21:45:50 PDT 2018 * clean up comment * table * Mon Mar 12 22:51:57 PDT 2018 * Mon Mar 12 22:53:05 PDT 2018 * Mon Mar 12 22:55:03 PDT 2018 * Mon Mar 12 22:56:18 PDT 2018 * Mon Mar 12 22:59:54 PDT 2018 * Update apex_optimizer.py * Update index.rst * Update README.rst * Update README.rst * comments * Wed Mar 14 19:01:02 PDT 2018 * Fri Mar 16 15:44:27 PDT 2018 2018-03-17 14:45:04 -07:00			`RLlib supports using its policy optimizer implementations from external algorithms.`
[rllib] [docs] Cleanup RLlib API and make docs consistent with upcoming blog post (#1708) * wip * more work * fix apex * docs * apex doc * pool comment * clean up * make wrap stack pluggable * Mon Mar 12 21:45:50 PDT 2018 * clean up comment * table * Mon Mar 12 22:51:57 PDT 2018 * Mon Mar 12 22:53:05 PDT 2018 * Mon Mar 12 22:55:03 PDT 2018 * Mon Mar 12 22:56:18 PDT 2018 * Mon Mar 12 22:59:54 PDT 2018 * Update apex_optimizer.py * Update index.rst * Update README.rst * Update README.rst * comments * Wed Mar 14 19:01:02 PDT 2018 2018-03-15 15:57:31 -07:00
			Example of constructing and using a policy optimizer `(link to full example) <https://github.com/ericl/baselines/blob/rllib-example/baselines/deepq/run_simple_loop.py>`__:

			`.. code-block:: python`

			`ray.init()`
			`env_creator = lambda env_config: gym.make("PongNoFrameskip-v4")`
			`optimizer = LocalSyncReplayOptimizer.make(`
			`YourEvaluatorClass, [env_creator], num_workers=0, optimizer_config={})`

			`i = 0`
			`while optimizer.num_steps_sampled < 100000:`
			`i += 1`
			`print("== optimizer step {} ==".format(i))`
			`optimizer.step()`
			`print("optimizer stats", optimizer.stats())`
			`print("local evaluator stats", optimizer.local_evaluator.stats())`

[rllib] Add DDPG documentation, rename DDPG2 <=> DDPG (#1946) * updates * updates * updates * updates * updates * updates * Update rllib.rst * Update policy-optimizers.rst 2018-04-30 00:18:15 -07:00			Read more about policy optimizers in this post: `Distributed Policy Optimizers for Scalable and Reproducible Deep RL <https://rise.cs.berkeley.edu/blog/distributed-policy-optimizers-for-scalable-and-reproducible-deep-rl/>`__.

[rllib] [docs] Cleanup RLlib API and make docs consistent with upcoming blog post (#1708) * wip * more work * fix apex * docs * apex doc * pool comment * clean up * make wrap stack pluggable * Mon Mar 12 21:45:50 PDT 2018 * clean up comment * table * Mon Mar 12 22:51:57 PDT 2018 * Mon Mar 12 22:53:05 PDT 2018 * Mon Mar 12 22:55:03 PDT 2018 * Mon Mar 12 22:56:18 PDT 2018 * Mon Mar 12 22:59:54 PDT 2018 * Update apex_optimizer.py * Update index.rst * Update README.rst * Update README.rst * comments * Wed Mar 14 19:01:02 PDT 2018 2018-03-15 15:57:31 -07:00			`Here are the steps for using a RLlib policy optimizer with an existing algorithm.`

			1. Implement the `Policy evaluator interface <rllib-dev.html#policy-evaluators-and-optimizers>`__.

			- Here is an example of porting a `PyTorch Rainbow implementation <https://github.com/ericl/Rainbow/blob/rllib-example/rainbow_evaluator.py>`__.

			- Another example porting a `TensorFlow DQN implementation <https://github.com/ericl/baselines/blob/rllib-example/baselines/deepq/dqn_evaluator.py>`__.

			2. Pick a `Policy optimizer class <https://github.com/ray-project/ray/tree/master/python/ray/rllib/optimizers>`__. The `LocalSyncOptimizer <https://github.com/ray-project/ray/blob/master/python/ray/rllib/optimizers/local_sync.py>`__ is a reasonable choice for local testing. You can also implement your own. Policy optimizers can be constructed using their ``make`` method (e.g., ``LocalSyncOptimizer.make(evaluator_cls, evaluator_args, num_workers, optimizer_config)``), or you can construct them by passing in a list of evaluators instantiated as Ray actors.

			- Here is code showing the `simple Policy Gradient agent <https://github.com/ray-project/ray/blob/master/python/ray/rllib/pg/pg.py>`__ using ``make()``.

			- A different example showing an `A3C agent <https://github.com/ray-project/ray/blob/master/python/ray/rllib/a3c/a3c.py>`__ passing in Ray actors directly.

			`3. Decide how you want to drive the training loop.`

			- Option 1: call ``optimizer.step()`` from some existing training code. Training statistics can be retrieved by querying the ``optimizer.local_evaluator`` evaluator instance, or mapping over the remote evaluators (e.g., ``ray.get([ev.some_fn.remote() for ev in optimizer.remote_evaluators])``) if you are running with multiple workers.

			- Option 2: define a full RLlib `Agent class <https://github.com/ray-project/ray/blob/master/python/ray/rllib/agent.py>`__. This might be preferable if you don't have an existing training harness or want to use features provided by `Ray Tune <tune.html>`__.

			`Available Policy Optimizers`
			`---------------------------`

			`+-----------------------------+---------------------+-----------------+------------------------------+`
			`\| Policy optimizer class \| Operating range \| Works with \| Description \|`
			`+=============================+=====================+=================+==============================+`
			`\|AsyncOptimizer \|1-10s of CPUs \|(any) \|Asynchronous gradient-based \|`
			`\| \| \| \|optimization (e.g., A3C) \|`
			`+-----------------------------+---------------------+-----------------+------------------------------+`
			`\|LocalSyncOptimizer \|0-1 GPUs + \|(any) \|Synchronous gradient-based \|`
			`\| \|1-100s of CPUs \| \|optimization with parallel \|`
			`\| \| \| \|sample collection \|`
			`+-----------------------------+---------------------+-----------------+------------------------------+`
			`\|LocalSyncReplayOptimizer \|0-1 GPUs + \| Off-policy \|Adds a replay buffer \|`
			`\| \|1-100s of CPUs \| algorithms \|to LocalSyncOptimizer \|`
			`+-----------------------------+---------------------+-----------------+------------------------------+`
			`\|LocalMultiGPUOptimizer \|0-10 GPUs + \| Algorithms \|Implements data-parallel \|`
			`\| \|1-100s of CPUs \| written in \|optimization over multiple \|`
			`\| \| \| TensorFlow \|GPUs, e.g., for PPO \|`
			`+-----------------------------+---------------------+-----------------+------------------------------+`
			`\|ApexOptimizer \|1 GPU + \| Off-policy \|Implements the Ape-X \|`
			`\| \|10-100s of CPUs \| algorithms \|distributed prioritization \|`
			`\| \| \| w/sample \|algorithm \|`
			`\| \| \| prioritization \| \|`
			`+-----------------------------+---------------------+-----------------+------------------------------+`