ray/python/ray/rllib/pg/pg_policy_graph.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf

import ray
from ray.rllib.models.catalog import ModelCatalog
from ray.rllib.utils.process_rollout import compute_advantages
from ray.rllib.utils.tf_policy_graph import TFPolicyGraph


class PGPolicyGraph(TFPolicyGraph):

    def __init__(self, obs_space, action_space, config):
        config = dict(ray.rllib.pg.pg.DEFAULT_CONFIG, **config)
        self.config = config

        # setup policy
        self.x = tf.placeholder(tf.float32, shape=[None]+list(obs_space.shape))
        dist_class, self.logit_dim = ModelCatalog.get_action_dist(
            action_space, self.config["model"])
        self.model = ModelCatalog.get_model(
            self.x, self.logit_dim, options=self.config["model"])
        self.dist = dist_class(self.model.outputs)  # logit for each action

        # setup policy loss
        self.ac = ModelCatalog.get_action_placeholder(action_space)
        self.adv = tf.placeholder(tf.float32, [None], name="adv")
        self.loss = -tf.reduce_mean(self.dist.logp(self.ac) * self.adv)

        # initialize TFPolicyGraph
        self.sess = tf.get_default_session()
        self.loss_in = [
            ("obs", self.x),
            ("actions", self.ac),
            ("advantages", self.adv),
        ]
        self.is_training = tf.placeholder_with_default(True, ())
        TFPolicyGraph.__init__(
            self, obs_space, action_space, self.sess, obs_input=self.x,
            action_sampler=self.dist.sample(), loss=self.loss,
            loss_inputs=self.loss_in, is_training=self.is_training)
        self.sess.run(tf.global_variables_initializer())

    def postprocess_trajectory(self, sample_batch, other_agent_batches=None):
        return compute_advantages(
            sample_batch, 0.0, self.config["gamma"], use_gae=False)
[rllib] Refactor rllib to have a common sample collection pathway (#2149) 2018-06-09 00:21:35 -07:00			`from __future__ import absolute_import`
			`from __future__ import division`
			`from __future__ import print_function`

			`import tensorflow as tf`

[rllib] Part 2 of multiagent support (#2286) * wip * cls * re * wip * wip * a3c working * torch support * pg works * lint * rm v2 * consumer id * clean up pg * clean up more * fix python 2.7 * tf session management * docs * dqn wip * fix compile * dqn * apex runs * up * impotrs * ddpg * quotes * fix tests * fix last r * fix tests * lint * pass checkpoint restore * kwar * nits * policy graph * fix yapf * com * class * pyt * vectorization * update * test cpe * unit test * fix ddpg2 * changes * wip * args * faster test * common * fix * add alg option * batch mode and policy serving * multi serving test * todo * wip * serving test * doc async env * num envs * comments * thread * remove init hook * update * fix ppo * comments1 * fix * updates * add jenkins tests * fix * fix pytorch * fix * fixes * fix a3c policy * fix squeeze * fix trunc on apex * fix squeezing for real * update * remove horizon test for now * multiagent wip * update * fix race condition * fix ma * t * doc * st * wip * example * wip * working * cartpole * wip * batch wip * fix bug * make other_batches None default * working * debug * nit * warn * comments * fix ppo * fix obs filter * update * fix obs filter * pass thru worker index * fix * fix log action * debug name * fix sphinx 2018-06-25 22:33:57 -07:00			`import ray`
[rllib] Refactor rllib to have a common sample collection pathway (#2149) 2018-06-09 00:21:35 -07:00			`from ray.rllib.models.catalog import ModelCatalog`
			`from ray.rllib.utils.process_rollout import compute_advantages`
			`from ray.rllib.utils.tf_policy_graph import TFPolicyGraph`


			`class PGPolicyGraph(TFPolicyGraph):`

[rllib] Remove need to pass around registry (#2250) * remove registry * fix * too many _ * fix * cloudpickle * Update registry.py * yapf * fix test * fix kv check 2018-06-19 22:47:00 -07:00			`def __init__(self, obs_space, action_space, config):`
[rllib] Part 2 of multiagent support (#2286) * wip * cls * re * wip * wip * a3c working * torch support * pg works * lint * rm v2 * consumer id * clean up pg * clean up more * fix python 2.7 * tf session management * docs * dqn wip * fix compile * dqn * apex runs * up * impotrs * ddpg * quotes * fix tests * fix last r * fix tests * lint * pass checkpoint restore * kwar * nits * policy graph * fix yapf * com * class * pyt * vectorization * update * test cpe * unit test * fix ddpg2 * changes * wip * args * faster test * common * fix * add alg option * batch mode and policy serving * multi serving test * todo * wip * serving test * doc async env * num envs * comments * thread * remove init hook * update * fix ppo * comments1 * fix * updates * add jenkins tests * fix * fix pytorch * fix * fixes * fix a3c policy * fix squeeze * fix trunc on apex * fix squeezing for real * update * remove horizon test for now * multiagent wip * update * fix race condition * fix ma * t * doc * st * wip * example * wip * working * cartpole * wip * batch wip * fix bug * make other_batches None default * working * debug * nit * warn * comments * fix ppo * fix obs filter * update * fix obs filter * pass thru worker index * fix * fix log action * debug name * fix sphinx 2018-06-25 22:33:57 -07:00			`config = dict(ray.rllib.pg.pg.DEFAULT_CONFIG, **config)`
[rllib] Refactor rllib to have a common sample collection pathway (#2149) 2018-06-09 00:21:35 -07:00			`self.config = config`

			`# setup policy`
			`self.x = tf.placeholder(tf.float32, shape=[None]+list(obs_space.shape))`
[rllib] Add squash_to_range model option (#2239) * sigmoid * squash * squash true * git push * Update catalog.py 2018-06-19 19:47:26 -07:00			`dist_class, self.logit_dim = ModelCatalog.get_action_dist(`
			`action_space, self.config["model"])`
[rllib] Refactor rllib to have a common sample collection pathway (#2149) 2018-06-09 00:21:35 -07:00			`self.model = ModelCatalog.get_model(`
[rllib] Remove need to pass around registry (#2250) * remove registry * fix * too many _ * fix * cloudpickle * Update registry.py * yapf * fix test * fix kv check 2018-06-19 22:47:00 -07:00			`self.x, self.logit_dim, options=self.config["model"])`
[rllib] Refactor rllib to have a common sample collection pathway (#2149) 2018-06-09 00:21:35 -07:00			`self.dist = dist_class(self.model.outputs) # logit for each action`

			`# setup policy loss`
			`self.ac = ModelCatalog.get_action_placeholder(action_space)`
			`self.adv = tf.placeholder(tf.float32, [None], name="adv")`
			`self.loss = -tf.reduce_mean(self.dist.logp(self.ac) * self.adv)`

			`# initialize TFPolicyGraph`
			`self.sess = tf.get_default_session()`
			`self.loss_in = [`
			`("obs", self.x),`
			`("actions", self.ac),`
			`("advantages", self.adv),`
			`]`
			`self.is_training = tf.placeholder_with_default(True, ())`
			`TFPolicyGraph.__init__(`
[rllib] Part 2 of multiagent support (#2286) * wip * cls * re * wip * wip * a3c working * torch support * pg works * lint * rm v2 * consumer id * clean up pg * clean up more * fix python 2.7 * tf session management * docs * dqn wip * fix compile * dqn * apex runs * up * impotrs * ddpg * quotes * fix tests * fix last r * fix tests * lint * pass checkpoint restore * kwar * nits * policy graph * fix yapf * com * class * pyt * vectorization * update * test cpe * unit test * fix ddpg2 * changes * wip * args * faster test * common * fix * add alg option * batch mode and policy serving * multi serving test * todo * wip * serving test * doc async env * num envs * comments * thread * remove init hook * update * fix ppo * comments1 * fix * updates * add jenkins tests * fix * fix pytorch * fix * fixes * fix a3c policy * fix squeeze * fix trunc on apex * fix squeezing for real * update * remove horizon test for now * multiagent wip * update * fix race condition * fix ma * t * doc * st * wip * example * wip * working * cartpole * wip * batch wip * fix bug * make other_batches None default * working * debug * nit * warn * comments * fix ppo * fix obs filter * update * fix obs filter * pass thru worker index * fix * fix log action * debug name * fix sphinx 2018-06-25 22:33:57 -07:00			`self, obs_space, action_space, self.sess, obs_input=self.x,`
[rllib] Refactor rllib to have a common sample collection pathway (#2149) 2018-06-09 00:21:35 -07:00			`action_sampler=self.dist.sample(), loss=self.loss,`
			`loss_inputs=self.loss_in, is_training=self.is_training)`
			`self.sess.run(tf.global_variables_initializer())`

			`def postprocess_trajectory(self, sample_batch, other_agent_batches=None):`
			`return compute_advantages(`
			`sample_batch, 0.0, self.config["gamma"], use_gae=False)`