ray/doc/source/example-rl-pong.rst
Wapaul1 b1cb48159a Examples updated with actors. (#358)
* Updated examples with actors

* Small changes, and convert documentation from MD to RST.
2017-03-11 15:30:31 -08:00

70 lines
2.3 KiB
ReStructuredText

Learning to Play Pong
=====================
In this example, we'll be training a neural network to play Pong using the
OpenAI Gym. This application is adapted, with minimal modifications, from Andrej
Karpathy's
[code](https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5) (see
the accompanying [blog post](http://karpathy.github.io/2016/05/31/rl/)). To run
the application, first install some dependencies.
.. code-block:: bash
pip install gym[atari]
Then you can run the example as follows.
.. code-block:: bash
python ray/examples/rl_pong/driver.py
The distributed version
-----------------------
At the core of [Andrej's
code](https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5), a
neural network is used to define a "policy" for playing Pong (that is, a
function that chooses an action given a state). In the loop, the network
repeatedly plays games of Pong and records a gradient from each game. Every ten
games, the gradients are combined together and used to update the network.
This example is easy to parallelize because the network can play ten games in
parallel and no information needs to be shared between the games.
We define an **actor** for the Pong environment, which includes a method for
performing a rollout and computing a gradient update. Below is pseudocode for
the actor.
.. code-block:: python
@ray.actor
class PongEnv(object):
def __init__(self):
self.env = gym.make("Pong-v0")
def compute_gradient(self, model):
# Reset the game.
observation = self.env.reset()
while not done:
# Choose an action using policy_forward.
# Take the action and observe the new state of the world.
# Compute a gradient using policy_backward. Return the gradient and reward.
return [gradient, reward_sum]
We then create a number of actors, so that we can perform rollouts in parallel.
.. code-block:: python
actors = [PongEnv() for _ in range(batch_size)]
Calling this remote function inside of a for loop, we launch multiple tasks to
perform rollouts and compute gradients in parallel.
.. code-block:: python
model_id = ray.put(model)
actions = []
# Launch tasks to compute gradients from multiple rollouts in parallel.
for i in range(batch_size):
action_id = actors[i].compute_gradient(model_id)
actions.append(action_id)