mirror of
https://github.com/vale981/ray
synced 2025-03-06 18:41:40 -05:00
114 lines
4.3 KiB
Markdown
114 lines
4.3 KiB
Markdown
# Learning to Play Pong
|
|
|
|
In this example, we'll be training a neural network to play Pong using the
|
|
OpenAI Gym. This application is adapted, with minimal modifications, from Andrej
|
|
Karpathy's
|
|
[code](https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5) (see
|
|
the accompanying [blog post](http://karpathy.github.io/2016/05/31/rl/)). To run
|
|
the application, first install this dependency.
|
|
|
|
- [Gym](https://gym.openai.com/)
|
|
|
|
Then from the directory `ray/examples/rl_pong/` run the following.
|
|
|
|
```
|
|
source ../../setup-env.sh
|
|
python driver.py
|
|
```
|
|
|
|
## The distributed version
|
|
|
|
At the core of [Andrej's
|
|
code](https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5), a
|
|
neural network is used to define a "policy" for playing Pong (that is, a
|
|
function that chooses an action given a state). In the loop, the network
|
|
repeatedly plays games of Pong and records a gradient from each game. Every ten
|
|
games, the gradients are combined together and used to update the network.
|
|
|
|
This example is easy to parallelize because the network can play ten games in
|
|
parallel and no information needs to be shared between the games. We define a
|
|
remote function `compute_gradient`, which plays a game of pong and returns an
|
|
estimate of the gradient. Below is a simplified pseudocode version of this
|
|
function.
|
|
|
|
```python
|
|
@ray.remote(num_return_vals=2)
|
|
def compute_gradient(model):
|
|
# Retrieve the game environment.
|
|
env = ray.reusables.env
|
|
# Reset the game.
|
|
observation = env.reset()
|
|
while not done:
|
|
# Choose an action using policy_forward.
|
|
# Take the action and observe the new state of the world.
|
|
# Compute a gradient using policy_backward. Return the gradient and reward.
|
|
return gradient, reward_sum
|
|
```
|
|
|
|
Calling this remote function inside of a for loop, we launch multiple tasks to
|
|
perform rollouts and compute gradients. If we have at least ten worker
|
|
processes, then these tasks will all be executed in parallel.
|
|
|
|
```python
|
|
model_id = ray.put(model)
|
|
grads, reward_sums = [], []
|
|
# Launch tasks to compute gradients from multiple rollouts in parallel.
|
|
for i in range(10):
|
|
grad_id, reward_sum_id = compute_gradient.remote(model_id)
|
|
grads.append(grad_id)
|
|
reward_sums.append(reward_sum_id)
|
|
```
|
|
|
|
### Reusing the Gym environment
|
|
|
|
Workers are long-running Python processes, and though we'd like to think of
|
|
workers as being stateless, sometimes it's important to have a variable that
|
|
gets shared between different tasks on the same worker (perhaps because it is
|
|
expensive to initialize the variable).
|
|
|
|
In this example, we'd like each worker to have access to a Pong environment. The
|
|
Pong environment has state that gets mutated by the task, and this state is
|
|
shared between tasks that run on the same worker, so there is some danger that
|
|
the output of the overall program will depend on which tasks are scheduled on
|
|
which workers. This can be avoided if the state of the Pong environment is reset
|
|
between tasks.
|
|
|
|
To accomplish this, the user must mark the Pong environment as a reusable
|
|
variable. This is done by providing a method for initializing the gym, and
|
|
storing it in `ray.reusables`.
|
|
|
|
```python
|
|
# Function for initializing the gym environment.
|
|
def env_initializer():
|
|
return gym.make("Pong-v0")
|
|
|
|
# Create a reusable variable for the gym environment.
|
|
ray.reusables.env = ray.Reusable(env_initializer)
|
|
```
|
|
|
|
A remote task can then call `ray.reusables.env` to retrieve the variable.
|
|
|
|
By default, whenever a task uses the `ray.reusables.env` variable, the worker
|
|
that the task was scheduled on will rerun the initialization code
|
|
`env_initializer` after the task has finished so that state will not leak
|
|
between the tasks.
|
|
|
|
However, sometimes the initialization code is expensive, and there may be a
|
|
faster way to reinitialize the variable (or maybe no reinitialization is needed
|
|
at all). In these cases, the user can provide a custom **reinitializer**, which
|
|
gets run after any task that uses the variable.
|
|
|
|
```python
|
|
# Function for initializing the gym environment.
|
|
def env_initializer():
|
|
return gym.make("Pong-v0")
|
|
|
|
# Function for reinitializing the gym environment in order to guarantee that
|
|
# the state of the game is reset after each remote task.
|
|
def env_reinitializer(env):
|
|
env.reset()
|
|
return env
|
|
|
|
# Create a reusable variable for the gym environment.
|
|
ray.reusables.env = ray.Reusable(env_initializer, env_reinitializer)
|
|
```
|