ray/doc/source/example-evolution-strategies.rst

Evolution Strategies
====================

This document provides a walkthrough of the evolution strategies example.
To run the application, first install some dependencies.

.. code-block:: bash

  pip install tensorflow
  pip install gym

You can view the `code for this example`_.

.. _`code for this example`: https://github.com/ray-project/ray/tree/master/python/ray/rllib/es

The script can be run as follows. Note that the configuration is tuned to work
on the ``Humanoid-v1`` gym environment.

.. code-block:: bash

  python/ray/rllib/train.py --env=Humanoid-v1 --run=ES

To train a policy on a cluster (e.g., using 900 workers), run the following.

.. code-block:: bash

  python ray/python/ray/rllib/train.py \
      --env=Humanoid-v1 \
      --run=ES \
      --redis-address=<redis-address> \
      --config='{"num_workers": 900, "episodes_per_batch": 10000, "timesteps_per_batch": 100000}'

At the heart of this example, we define a ``Worker`` class. These workers have
a method ``do_rollouts``, which will be used to perform simulate randomly
perturbed policies in a given environment.

.. code-block:: python

  @ray.remote
  class Worker(object):
      def __init__(self, config, policy_params, env_name, noise):
          self.env = # Initialize environment.
          self.policy = # Construct policy.
          # Details omitted.

      def do_rollouts(self, params):
          perturbation = # Generate a random perturbation to the policy.

          self.policy.set_weights(params + perturbation)
          # Do rollout with the perturbed policy.

          self.policy.set_weights(params - perturbation)
          # Do rollout with the perturbed policy.

          # Return the rewards.

In the main loop, we create a number of actors with this class.

.. code-block:: python

  workers = [Worker.remote(config, policy_params, env_name, noise_id)
             for _ in range(num_workers)]

We then enter an infinite loop in which we use the actors to perform rollouts
and use the rewards from the rollouts to update the policy.

.. code-block:: python

  while True:
      # Get the current policy weights.
      theta = policy.get_weights()
      # Put the current policy weights in the object store.
      theta_id = ray.put(theta)
      # Use the actors to do rollouts, note that we pass in the ID of the policy
      # weights.
      rollout_ids = [worker.do_rollouts.remote(theta_id), for worker in workers]
      # Get the results of the rollouts.
      results = ray.get(rollout_ids)
      # Update the policy.
      optimizer.update(...)

In addition, note that we create a large object representing a shared block of
random noise. We then put the block in the object store so that each ``Worker``
actor can use it without creating its own copy.

.. code-block:: python

  @ray.remote
  def create_shared_noise():
      noise = np.random.randn(250000000)
      return noise

  noise_id = create_shared_noise.remote()

Recall that the ``noise_id`` argument is passed into the actor constructor.