mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00
95 lines
3 KiB
ReStructuredText
95 lines
3 KiB
ReStructuredText
Evolution Strategies
|
|
====================
|
|
|
|
This document provides a walkthrough of the evolution strategies example.
|
|
To run the application, first install some dependencies.
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install tensorflow
|
|
pip install gym
|
|
|
|
You can view the `code for this example`_.
|
|
|
|
.. _`code for this example`: https://github.com/ray-project/ray/tree/master/python/ray/rllib/es
|
|
|
|
The script can be run as follows. Note that the configuration is tuned to work
|
|
on the ``Humanoid-v1`` gym environment.
|
|
|
|
.. code-block:: bash
|
|
|
|
python/ray/rllib/train.py --env=Humanoid-v1 --run=ES
|
|
|
|
To train a policy on a cluster (e.g., using 900 workers), run the following.
|
|
|
|
.. code-block:: bash
|
|
|
|
python ray/python/ray/rllib/train.py \
|
|
--env=Humanoid-v1 \
|
|
--run=ES \
|
|
--redis-address=<redis-address> \
|
|
--config='{"num_workers": 900, "episodes_per_batch": 10000, "timesteps_per_batch": 100000}'
|
|
|
|
At the heart of this example, we define a ``Worker`` class. These workers have
|
|
a method ``do_rollouts``, which will be used to perform simulate randomly
|
|
perturbed policies in a given environment.
|
|
|
|
.. code-block:: python
|
|
|
|
@ray.remote
|
|
class Worker(object):
|
|
def __init__(self, config, policy_params, env_name, noise):
|
|
self.env = # Initialize environment.
|
|
self.policy = # Construct policy.
|
|
# Details omitted.
|
|
|
|
def do_rollouts(self, params):
|
|
perturbation = # Generate a random perturbation to the policy.
|
|
|
|
self.policy.set_weights(params + perturbation)
|
|
# Do rollout with the perturbed policy.
|
|
|
|
self.policy.set_weights(params - perturbation)
|
|
# Do rollout with the perturbed policy.
|
|
|
|
# Return the rewards.
|
|
|
|
In the main loop, we create a number of actors with this class.
|
|
|
|
.. code-block:: python
|
|
|
|
workers = [Worker.remote(config, policy_params, env_name, noise_id)
|
|
for _ in range(num_workers)]
|
|
|
|
We then enter an infinite loop in which we use the actors to perform rollouts
|
|
and use the rewards from the rollouts to update the policy.
|
|
|
|
.. code-block:: python
|
|
|
|
while True:
|
|
# Get the current policy weights.
|
|
theta = policy.get_weights()
|
|
# Put the current policy weights in the object store.
|
|
theta_id = ray.put(theta)
|
|
# Use the actors to do rollouts, note that we pass in the ID of the policy
|
|
# weights.
|
|
rollout_ids = [worker.do_rollouts.remote(theta_id), for worker in workers]
|
|
# Get the results of the rollouts.
|
|
results = ray.get(rollout_ids)
|
|
# Update the policy.
|
|
optimizer.update(...)
|
|
|
|
In addition, note that we create a large object representing a shared block of
|
|
random noise. We then put the block in the object store so that each ``Worker``
|
|
actor can use it without creating its own copy.
|
|
|
|
.. code-block:: python
|
|
|
|
@ray.remote
|
|
def create_shared_noise():
|
|
noise = np.random.randn(250000000)
|
|
return noise
|
|
|
|
noise_id = create_shared_noise.remote()
|
|
|
|
Recall that the ``noise_id`` argument is passed into the actor constructor.
|