ray/python/ray/rllib/agents/es/policies.py

# Code in this file is copied and adapted from
# https://github.com/openai/evolution-strategies-starter.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import numpy as np
import tensorflow as tf

import ray
from ray.rllib.models import ModelCatalog
from ray.rllib.utils.filter import get_filter


def rollout(policy, env, timestep_limit=None, add_noise=False):
    """Do a rollout.

    If add_noise is True, the rollout will take noisy actions with
    noise drawn from that stream. Otherwise, no action noise will be added.
    """
    env_timestep_limit = env.spec.max_episode_steps
    timestep_limit = (env_timestep_limit if timestep_limit is None
                      else min(timestep_limit, env_timestep_limit))
    rews = []
    t = 0
    observation = env.reset()
    for _ in range(timestep_limit or 999999):
        ac = policy.compute(observation, add_noise=add_noise)[0]
        observation, rew, done, _ = env.step(ac)
        rews.append(rew)
        t += 1
        if done:
            break
    rews = np.array(rews, dtype=np.float32)
    return rews, t


class GenericPolicy(object):
    def __init__(self, sess, action_space, preprocessor,
                 observation_filter, action_noise_std):
        self.sess = sess
        self.action_space = action_space
        self.action_noise_std = action_noise_std
        self.preprocessor = preprocessor
        self.observation_filter = get_filter(
            observation_filter, self.preprocessor.shape)
        self.inputs = tf.placeholder(
            tf.float32, [None] + list(self.preprocessor.shape))

        # Policy network.
        dist_class, dist_dim = ModelCatalog.get_action_dist(
            self.action_space, dist_type="deterministic")
        model = ModelCatalog.get_model(self.inputs, dist_dim)
        dist = dist_class(model.outputs)
        self.sampler = dist.sample()

        self.variables = ray.experimental.TensorFlowVariables(
            model.outputs, self.sess)

        self.num_params = sum(np.prod(variable.shape.as_list())
                              for _, variable
                              in self.variables.variables.items())
        self.sess.run(tf.global_variables_initializer())

    def compute(self, observation, add_noise=False, update=True):
        observation = self.preprocessor.transform(observation)
        observation = self.observation_filter(observation[None], update=update)
        action = self.sess.run(self.sampler,
                               feed_dict={self.inputs: observation})
        if add_noise and isinstance(self.action_space, gym.spaces.Box):
            action += np.random.randn(*action.shape) * self.action_noise_std
        return action

    def set_weights(self, x):
        self.variables.set_flat(x)

    def get_weights(self):
        return self.variables.get_flat()
Initial version of evolution strategies example. (#544) * Initial commit of evolution strategies example. * Some small simplifications. * Update example to use new API. * Add example to documentation. 2017-05-14 17:53:51 -07:00			`# Code in this file is copied and adapted from`
			`# https://github.com/openai/evolution-strategies-starter.`

			`from __future__ import absolute_import`
			`from __future__ import division`
			`from __future__ import print_function`

[rllib] Clean up evolution strategies example. (#1225) * Remove ES observation statistics. * Consolidate policy classes. * Remove random stream. * Move rollout function out of policy. * Consolidate policy initialization. * Replace act implementation with sess.run. * Remove tf_utils. * Remove variable scope. * Remove unused imports. * Use regular TF session. * Use MeanStdFilter. * Minor. * Clarify naming. * Update documentation. * eps -> episodes * Report noiseless evaluation runs. * Clean up naming. * Update documentation. * Fix some bugs. * Make it run on atari. * Don't add action noise during evaluation runs. * Add ES to checkpoint/restore test. * Small cleanups and remove redundant calls to get_weights. * Remove outdated comment. 2017-11-16 21:58:30 -08:00			`import gym`
Initial version of evolution strategies example. (#544) * Initial commit of evolution strategies example. * Some small simplifications. * Update example to use new API. * Add example to documentation. 2017-05-14 17:53:51 -07:00			`import numpy as np`
			`import tensorflow as tf`

[rllib] Clean up evolution strategies example. (#1225) * Remove ES observation statistics. * Consolidate policy classes. * Remove random stream. * Move rollout function out of policy. * Consolidate policy initialization. * Replace act implementation with sess.run. * Remove tf_utils. * Remove variable scope. * Remove unused imports. * Use regular TF session. * Use MeanStdFilter. * Minor. * Clarify naming. * Update documentation. * eps -> episodes * Report noiseless evaluation runs. * Clean up naming. * Update documentation. * Fix some bugs. * Make it run on atari. * Don't add action noise during evaluation runs. * Add ES to checkpoint/restore test. * Small cleanups and remove redundant calls to get_weights. * Remove outdated comment. 2017-11-16 21:58:30 -08:00			`import ray`
[rllib] Pull out shared models for evolution strategies and policy gradient. (#719) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * fix linting * more 4 space * fix * fix linT * oops * es parity 2017-07-17 01:58:54 -07:00			`from ray.rllib.models import ModelCatalog`
[rllib] Generalizing A3C Sampling Classes (#1250) 2017-11-30 00:22:25 -08:00			`from ray.rllib.utils.filter import get_filter`
[rllib] Clean up evolution strategies example. (#1225) * Remove ES observation statistics. * Consolidate policy classes. * Remove random stream. * Move rollout function out of policy. * Consolidate policy initialization. * Replace act implementation with sess.run. * Remove tf_utils. * Remove variable scope. * Remove unused imports. * Use regular TF session. * Use MeanStdFilter. * Minor. * Clarify naming. * Update documentation. * eps -> episodes * Report noiseless evaluation runs. * Clean up naming. * Update documentation. * Fix some bugs. * Make it run on atari. * Don't add action noise during evaluation runs. * Add ES to checkpoint/restore test. * Small cleanups and remove redundant calls to get_weights. * Remove outdated comment. 2017-11-16 21:58:30 -08:00

			`def rollout(policy, env, timestep_limit=None, add_noise=False):`
			`"""Do a rollout.`

			`If add_noise is True, the rollout will take noisy actions with`
			`noise drawn from that stream. Otherwise, no action noise will be added.`
			`"""`
[rllib] Upgrade to OpenAI Gym 0.10.3 (#1601) 2018-03-06 08:31:02 +00:00			`env_timestep_limit = env.spec.max_episode_steps`
[rllib] Clean up evolution strategies example. (#1225) * Remove ES observation statistics. * Consolidate policy classes. * Remove random stream. * Move rollout function out of policy. * Consolidate policy initialization. * Replace act implementation with sess.run. * Remove tf_utils. * Remove variable scope. * Remove unused imports. * Use regular TF session. * Use MeanStdFilter. * Minor. * Clarify naming. * Update documentation. * eps -> episodes * Report noiseless evaluation runs. * Clean up naming. * Update documentation. * Fix some bugs. * Make it run on atari. * Don't add action noise during evaluation runs. * Add ES to checkpoint/restore test. * Small cleanups and remove redundant calls to get_weights. * Remove outdated comment. 2017-11-16 21:58:30 -08:00			`timestep_limit = (env_timestep_limit if timestep_limit is None`
			`else min(timestep_limit, env_timestep_limit))`
			`rews = []`
			`t = 0`
			`observation = env.reset()`
[rllib] test all combinations of {obs_space} x {action_space} (#1449) 2018-01-24 11:03:43 -08:00			`for _ in range(timestep_limit or 999999):`
[rllib] Clean up evolution strategies example. (#1225) * Remove ES observation statistics. * Consolidate policy classes. * Remove random stream. * Move rollout function out of policy. * Consolidate policy initialization. * Replace act implementation with sess.run. * Remove tf_utils. * Remove variable scope. * Remove unused imports. * Use regular TF session. * Use MeanStdFilter. * Minor. * Clarify naming. * Update documentation. * eps -> episodes * Report noiseless evaluation runs. * Clean up naming. * Update documentation. * Fix some bugs. * Make it run on atari. * Don't add action noise during evaluation runs. * Add ES to checkpoint/restore test. * Small cleanups and remove redundant calls to get_weights. * Remove outdated comment. 2017-11-16 21:58:30 -08:00			`ac = policy.compute(observation, add_noise=add_noise)[0]`
			`observation, rew, done, _ = env.step(ac)`
			`rews.append(rew)`
			`t += 1`
			`if done:`
			`break`
			`rews = np.array(rews, dtype=np.float32)`
			`return rews, t`


			`class GenericPolicy(object):`
[rllib] Remove need to pass around registry (#2250) * remove registry * fix * too many _ * fix * cloudpickle * Update registry.py * yapf * fix test * fix kv check 2018-06-19 22:47:00 -07:00			`def __init__(self, sess, action_space, preprocessor,`
[rllib] Clean up evolution strategies example. (#1225) * Remove ES observation statistics. * Consolidate policy classes. * Remove random stream. * Move rollout function out of policy. * Consolidate policy initialization. * Replace act implementation with sess.run. * Remove tf_utils. * Remove variable scope. * Remove unused imports. * Use regular TF session. * Use MeanStdFilter. * Minor. * Clarify naming. * Update documentation. * eps -> episodes * Report noiseless evaluation runs. * Clean up naming. * Update documentation. * Fix some bugs. * Make it run on atari. * Don't add action noise during evaluation runs. * Add ES to checkpoint/restore test. * Small cleanups and remove redundant calls to get_weights. * Remove outdated comment. 2017-11-16 21:58:30 -08:00			`observation_filter, action_noise_std):`
			`self.sess = sess`
			`self.action_space = action_space`
			`self.action_noise_std = action_noise_std`
[rllib] Support discrete observation spaces such as FrozenLake-v0 (#1140) * add * remove transform_shape * fix test * fix 2017-10-23 23:16:52 -07:00			`self.preprocessor = preprocessor`
[rllib] Generalizing A3C Sampling Classes (#1250) 2017-11-30 00:22:25 -08:00			`self.observation_filter = get_filter(`
			`observation_filter, self.preprocessor.shape)`
[rllib] Clean up evolution strategies example. (#1225) * Remove ES observation statistics. * Consolidate policy classes. * Remove random stream. * Move rollout function out of policy. * Consolidate policy initialization. * Replace act implementation with sess.run. * Remove tf_utils. * Remove variable scope. * Remove unused imports. * Use regular TF session. * Use MeanStdFilter. * Minor. * Clarify naming. * Update documentation. * eps -> episodes * Report noiseless evaluation runs. * Clean up naming. * Update documentation. * Fix some bugs. * Make it run on atari. * Don't add action noise during evaluation runs. * Add ES to checkpoint/restore test. * Small cleanups and remove redundant calls to get_weights. * Remove outdated comment. 2017-11-16 21:58:30 -08:00			`self.inputs = tf.placeholder(`
			`tf.float32, [None] + list(self.preprocessor.shape))`

			`# Policy network.`
			`dist_class, dist_dim = ModelCatalog.get_action_dist(`
			`self.action_space, dist_type="deterministic")`
[rllib] Remove need to pass around registry (#2250) * remove registry * fix * too many _ * fix * cloudpickle * Update registry.py * yapf * fix test * fix kv check 2018-06-19 22:47:00 -07:00			`model = ModelCatalog.get_model(self.inputs, dist_dim)`
[rllib] Clean up evolution strategies example. (#1225) * Remove ES observation statistics. * Consolidate policy classes. * Remove random stream. * Move rollout function out of policy. * Consolidate policy initialization. * Replace act implementation with sess.run. * Remove tf_utils. * Remove variable scope. * Remove unused imports. * Use regular TF session. * Use MeanStdFilter. * Minor. * Clarify naming. * Update documentation. * eps -> episodes * Report noiseless evaluation runs. * Clean up naming. * Update documentation. * Fix some bugs. * Make it run on atari. * Don't add action noise during evaluation runs. * Add ES to checkpoint/restore test. * Small cleanups and remove redundant calls to get_weights. * Remove outdated comment. 2017-11-16 21:58:30 -08:00			`dist = dist_class(model.outputs)`
			`self.sampler = dist.sample()`

			`self.variables = ray.experimental.TensorFlowVariables(`
			`model.outputs, self.sess)`

Use flake8-comprehensions (#1976) * Add flake8 to Travis * Add flake8-comprehensions [flake8 plugin](https://github.com/adamchainz/flake8-comprehensions) that checks for useless constructions. * Use generators instead of lists where appropriate A lot of the builtins can take in generators instead of lists. This commit applies `flake8-comprehensions` to find them. * Fix lint error * Fix some string formatting The rest can be fixed in another PR * Fix compound literals syntax This should probably be merged after #1963. * dict() -> {} * Use dict literal syntax dict(...) -> {...} * Rewrite nested dicts * Fix hanging indent * Add missing import * Add missing quote * fmt * Add missing whitespace * rm duplicate pip install This is already installed in another file. * Fix indent * move `merge_dicts` into utils * Bring up to date with `master` * Add automatic syntax upgrade * rm pyupgrade In case users want to still use it on their own, the upgrade-syn.sh script was left in the `.travis` dir. 2018-05-20 16:15:06 -07:00			`self.num_params = sum(np.prod(variable.shape.as_list())`
			`for _, variable`
			`in self.variables.variables.items())`
[rllib] Clean up evolution strategies example. (#1225) * Remove ES observation statistics. * Consolidate policy classes. * Remove random stream. * Move rollout function out of policy. * Consolidate policy initialization. * Replace act implementation with sess.run. * Remove tf_utils. * Remove variable scope. * Remove unused imports. * Use regular TF session. * Use MeanStdFilter. * Minor. * Clarify naming. * Update documentation. * eps -> episodes * Report noiseless evaluation runs. * Clean up naming. * Update documentation. * Fix some bugs. * Make it run on atari. * Don't add action noise during evaluation runs. * Add ES to checkpoint/restore test. * Small cleanups and remove redundant calls to get_weights. * Remove outdated comment. 2017-11-16 21:58:30 -08:00			`self.sess.run(tf.global_variables_initializer())`

			`def compute(self, observation, add_noise=False, update=True):`
			`observation = self.preprocessor.transform(observation)`
			`observation = self.observation_filter(observation[None], update=update)`
			`action = self.sess.run(self.sampler,`
			`feed_dict={self.inputs: observation})`
			`if add_noise and isinstance(self.action_space, gym.spaces.Box):`
			`action += np.random.randn(action.shape) self.action_noise_std`
			`return action`

			`def set_weights(self, x):`
			`self.variables.set_flat(x)`

			`def get_weights(self):`
			`return self.variables.get_flat()`