ray/rllib/agents/pg/torch_pg_policy.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import ray
from ray.rllib.evaluation.postprocessing import compute_advantages, \
    Postprocessing
from ray.rllib.policy.sample_batch import SampleBatch
from ray.rllib.policy.torch_policy_template import build_torch_policy


def pg_torch_loss(policy, model, dist_class, train_batch):
    logits, _ = model.from_batch(train_batch)
    action_dist = dist_class(logits, model)
    log_probs = action_dist.logp(train_batch[SampleBatch.ACTIONS])
    # save the error in the policy object
    policy.pi_err = -train_batch[Postprocessing.ADVANTAGES].dot(
        log_probs.reshape(-1))
    return policy.pi_err


def postprocess_advantages(policy,
                           sample_batch,
                           other_agent_batches=None,
                           episode=None):
    return compute_advantages(
        sample_batch, 0.0, policy.config["gamma"], use_gae=False)


def pg_loss_stats(policy, train_batch):
    # the error is recorded when computing the loss
    return {"policy_loss": policy.pi_err.item()}


PGTorchPolicy = build_torch_policy(
    name="PGTorchPolicy",
    get_default_config=lambda: ray.rllib.agents.a3c.a3c.DEFAULT_CONFIG,
    loss_fn=pg_torch_loss,
    stats_fn=pg_loss_stats,
    postprocess_fn=postprocess_advantages)
[rllib] add torch pg (#3857) * add torch pg * add torch imports * added torch pg * working torch pg implementation * add pg pytorch * Update a3c.py * Update a3c.py * Update torch_policy_graph.py * Update torch_policy_graph.py 2019-02-16 19:54:14 -08:00			`from __future__ import absolute_import`
			`from __future__ import division`
			`from __future__ import print_function`

			`import ray`
[rllib] Minor cleanups to TFPolicyGraph: add init args, constants for loss inputs (#4478) 2019-03-29 12:44:23 -07:00			`from ray.rllib.evaluation.postprocessing import compute_advantages, \`
			`Postprocessing`
[rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (#4819) This implements some of the renames proposed in #4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. 2019-05-20 16:46:05 -07:00			`from ray.rllib.policy.sample_batch import SampleBatch`
			`from ray.rllib.policy.torch_policy_template import build_torch_policy`
[rllib] Minor cleanups to TFPolicyGraph: add init args, constants for loss inputs (#4478) 2019-03-29 12:44:23 -07:00

[rllib] Adds eager support with a generic `TFEagerPolicy` class (#5436) 2019-08-23 02:21:11 -04:00			`def pg_torch_loss(policy, model, dist_class, train_batch):`
			`logits, _ = model.from_batch(train_batch)`
			`action_dist = dist_class(logits, model)`
			`log_probs = action_dist.logp(train_batch[SampleBatch.ACTIONS])`
[rllib] [RFC] Dynamic definition of loss functions and modularization support (#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph 2019-05-18 00:23:11 -07:00			`# save the error in the policy object`
[rllib] Adds eager support with a generic `TFEagerPolicy` class (#5436) 2019-08-23 02:21:11 -04:00			`policy.pi_err = -train_batch[Postprocessing.ADVANTAGES].dot(`
[rllib] [RFC] Dynamic definition of loss functions and modularization support (#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph 2019-05-18 00:23:11 -07:00			`log_probs.reshape(-1))`
			`return policy.pi_err`
[rllib] Minor cleanups to TFPolicyGraph: add init args, constants for loss inputs (#4478) 2019-03-29 12:44:23 -07:00

[rllib] [RFC] Dynamic definition of loss functions and modularization support (#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph 2019-05-18 00:23:11 -07:00			`def postprocess_advantages(policy,`
			`sample_batch,`
			`other_agent_batches=None,`
			`episode=None):`
			`return compute_advantages(`
			`sample_batch, 0.0, policy.config["gamma"], use_gae=False)`
[rllib] add torch pg (#3857) * add torch pg * add torch imports * added torch pg * working torch pg implementation * add pg pytorch * Update a3c.py * Update a3c.py * Update torch_policy_graph.py * Update torch_policy_graph.py 2019-02-16 19:54:14 -08:00

[rllib] Adds eager support with a generic `TFEagerPolicy` class (#5436) 2019-08-23 02:21:11 -04:00			`def pg_loss_stats(policy, train_batch):`
[rllib] [RFC] Dynamic definition of loss functions and modularization support (#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph 2019-05-18 00:23:11 -07:00			`# the error is recorded when computing the loss`
			`return {"policy_loss": policy.pi_err.item()}`
[rllib] Support torch device and distributions. (#4553) 2019-04-12 11:39:14 -07:00
[rllib] add torch pg (#3857) * add torch pg * add torch imports * added torch pg * working torch pg implementation * add pg pytorch * Update a3c.py * Update a3c.py * Update torch_policy_graph.py * Update torch_policy_graph.py 2019-02-16 19:54:14 -08:00
[rllib] [RFC] Dynamic definition of loss functions and modularization support (#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph 2019-05-18 00:23:11 -07:00			`PGTorchPolicy = build_torch_policy(`
			`name="PGTorchPolicy",`
			`get_default_config=lambda: ray.rllib.agents.a3c.a3c.DEFAULT_CONFIG,`
			`loss_fn=pg_torch_loss,`
			`stats_fn=pg_loss_stats,`
			`postprocess_fn=postprocess_advantages)`