ray/rllib/algorithms/pg/pg_torch_policy.py

"""
PyTorch policy class used for PG.
"""

from typing import Dict, List, Type, Union

import ray
from ray.rllib.algorithms.pg.utils import post_process_advantages
from ray.rllib.evaluation.postprocessing import Postprocessing
from ray.rllib.models.torch.torch_action_dist import TorchDistributionWrapper
from ray.rllib.models.modelv2 import ModelV2
from ray.rllib.policy import Policy
from ray.rllib.policy.policy_template import build_policy_class
from ray.rllib.policy.sample_batch import SampleBatch
from ray.rllib.utils.framework import try_import_torch
from ray.rllib.utils.typing import TensorType

torch, _ = try_import_torch()


def pg_torch_loss(
    policy: Policy,
    model: ModelV2,
    dist_class: Type[TorchDistributionWrapper],
    train_batch: SampleBatch,
) -> Union[TensorType, List[TensorType]]:
    """The basic policy gradients loss function.

    Args:
        policy (Policy): The Policy to calculate the loss for.
        model (ModelV2): The Model to calculate the loss for.
        dist_class (Type[ActionDistribution]: The action distr. class.
        train_batch (SampleBatch): The training data.

    Returns:
        Union[TensorType, List[TensorType]]: A single loss tensor or a list
            of loss tensors.
    """
    # Pass the training data through our model to get distribution parameters.
    dist_inputs, _ = model(train_batch)

    # Create an action distribution object.
    action_dist = dist_class(dist_inputs, model)

    # Calculate the vanilla PG loss based on:
    # L = -E[ log(pi(a|s)) * A]
    log_probs = action_dist.logp(train_batch[SampleBatch.ACTIONS])

    # Final policy loss.
    policy_loss = -torch.mean(log_probs * train_batch[Postprocessing.ADVANTAGES])

    # Store values for stats function in model (tower), such that for
    # multi-GPU, we do not override them during the parallel loss phase.
    model.tower_stats["policy_loss"] = policy_loss

    return policy_loss


def pg_loss_stats(policy: Policy, train_batch: SampleBatch) -> Dict[str, TensorType]:
    """Returns the calculated loss in a stats dict.

    Args:
        policy (Policy): The Policy object.
        train_batch (SampleBatch): The data used for training.

    Returns:
        Dict[str, TensorType]: The stats dict.
    """

    return {
        "policy_loss": torch.mean(torch.stack(policy.get_tower_stats("policy_loss"))),
    }


# Build a child class of `TorchPolicy`, given the extra options:
# - trajectory post-processing function (to calculate advantages)
# - PG loss function
PGTorchPolicy = build_policy_class(
    name="PGTorchPolicy",
    framework="torch",
    get_default_config=lambda: ray.rllib.algorithms.pg.DEFAULT_CONFIG,
    loss_fn=pg_torch_loss,
    stats_fn=pg_loss_stats,
    postprocess_fn=post_process_advantages,
)
[RLlib] First attempt at cleaning up algo code in RLlib: PG. (#10115) 2020-08-20 17:05:57 +02:00			`"""`
			`PyTorch policy class used for PG.`
			`"""`

			`from typing import Dict, List, Type, Union`

PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650) * Unifying the code for PGTrainer/Policy wrt tf vs torch. Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch). * Fix LINT line-len errors. * Fix LINT errors. * Fix `tf_pg_policy` imports (formerly: `pg_policy`). * Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer. Retire `PGAgent` class (use PGTrainer instead). * - Move PG test into agents/pg/tests directory. - All test cases will be located near the classes that are tested and then built into the Bazel/Travis test suite. * Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c the function is not a tf-specific one. * Fix remaining import errors for agents/pg/... * Fix circular dependency in pg imports. * Add pg tests to Jenkins test suite. 2020-01-02 19:08:03 -05:00			`import ray`
[RLlib] Agents to algos: DQN w/o Apex and R2D2, DDPG/TD3, SAC, SlateQ, QMIX, PG, Bandits (#24896) 2022-05-19 09:30:42 -07:00			`from ray.rllib.algorithms.pg.utils import post_process_advantages`
PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650) * Unifying the code for PGTrainer/Policy wrt tf vs torch. Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch). * Fix LINT line-len errors. * Fix LINT errors. * Fix `tf_pg_policy` imports (formerly: `pg_policy`). * Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer. Retire `PGAgent` class (use PGTrainer instead). * - Move PG test into agents/pg/tests directory. - All test cases will be located near the classes that are tested and then built into the Bazel/Travis test suite. * Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c the function is not a tf-specific one. * Fix remaining import errors for agents/pg/... * Fix circular dependency in pg imports. * Add pg tests to Jenkins test suite. 2020-01-02 19:08:03 -05:00			`from ray.rllib.evaluation.postprocessing import Postprocessing`
[RLlib] First attempt at cleaning up algo code in RLlib: PG. (#10115) 2020-08-20 17:05:57 +02:00			`from ray.rllib.models.torch.torch_action_dist import TorchDistributionWrapper`
			`from ray.rllib.models.modelv2 import ModelV2`
			`from ray.rllib.policy import Policy`
[RLlib] JAXPolicy prep. PR #1. (#13077) 2020-12-26 20:14:18 -05:00			`from ray.rllib.policy.policy_template import build_policy_class`
PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650) * Unifying the code for PGTrainer/Policy wrt tf vs torch. Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch). * Fix LINT line-len errors. * Fix LINT errors. * Fix `tf_pg_policy` imports (formerly: `pg_policy`). * Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer. Retire `PGAgent` class (use PGTrainer instead). * - Move PG test into agents/pg/tests directory. - All test cases will be located near the classes that are tested and then built into the Bazel/Travis test suite. * Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c the function is not a tf-specific one. * Fix remaining import errors for agents/pg/... * Fix circular dependency in pg imports. * Add pg tests to Jenkins test suite. 2020-01-02 19:08:03 -05:00			`from ray.rllib.policy.sample_batch import SampleBatch`
			`from ray.rllib.utils.framework import try_import_torch`
[RLlib] First attempt at cleaning up algo code in RLlib: PG. (#10115) 2020-08-20 17:05:57 +02:00			`from ray.rllib.utils.typing import TensorType`
PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650) * Unifying the code for PGTrainer/Policy wrt tf vs torch. Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch). * Fix LINT line-len errors. * Fix LINT errors. * Fix `tf_pg_policy` imports (formerly: `pg_policy`). * Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer. Retire `PGAgent` class (use PGTrainer instead). * - Move PG test into agents/pg/tests directory. - All test cases will be located near the classes that are tested and then built into the Bazel/Travis test suite. * Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c the function is not a tf-specific one. * Fix remaining import errors for agents/pg/... * Fix circular dependency in pg imports. * Add pg tests to Jenkins test suite. 2020-01-02 19:08:03 -05:00
			`torch, _ = try_import_torch()`


[RLlib] First attempt at cleaning up algo code in RLlib: PG. (#10115) 2020-08-20 17:05:57 +02:00			`def pg_torch_loss(`
			`policy: Policy,`
			`model: ModelV2,`
			`dist_class: Type[TorchDistributionWrapper],`
			`train_batch: SampleBatch,`
			`) -> Union[TensorType, List[TensorType]]:`
			`"""The basic policy gradients loss function.`

			`Args:`
			`policy (Policy): The Policy to calculate the loss for.`
			`model (ModelV2): The Model to calculate the loss for.`
			`dist_class (Type[ActionDistribution]: The action distr. class.`
			`train_batch (SampleBatch): The training data.`

			`Returns:`
			`Union[TensorType, List[TensorType]]: A single loss tensor or a list`
			`of loss tensors.`
			`"""`
			`# Pass the training data through our model to get distribution parameters.`
[RLlib] Fix failing test cases: Soft-deprecate ModelV2.from_batch (in favor of ModelV2.__call__). (#19693) 2021-10-25 15:00:00 +02:00			`dist_inputs, _ = model(train_batch)`
[RLlib] First attempt at cleaning up algo code in RLlib: PG. (#10115) 2020-08-20 17:05:57 +02:00
			`# Create an action distribution object.`
			`action_dist = dist_class(dist_inputs, model)`

			`# Calculate the vanilla PG loss based on:`
			`# L = -E[ log(pi(a\|s)) * A]`
PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650) * Unifying the code for PGTrainer/Policy wrt tf vs torch. Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch). * Fix LINT line-len errors. * Fix LINT errors. * Fix `tf_pg_policy` imports (formerly: `pg_policy`). * Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer. Retire `PGAgent` class (use PGTrainer instead). * - Move PG test into agents/pg/tests directory. - All test cases will be located near the classes that are tested and then built into the Bazel/Travis test suite. * Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c the function is not a tf-specific one. * Fix remaining import errors for agents/pg/... * Fix circular dependency in pg imports. * Add pg tests to Jenkins test suite. 2020-01-02 19:08:03 -05:00			`log_probs = action_dist.logp(train_batch[SampleBatch.ACTIONS])`
[RLlib] First attempt at cleaning up algo code in RLlib: PG. (#10115) 2020-08-20 17:05:57 +02:00
[RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. (#18937) 2021-10-04 13:29:00 +02:00			`# Final policy loss.`
[RLlib] Exploration API: merge deterministic flag with exploration classes (SoftQ and StochasticSampling). (#7155) 2020-02-19 21:18:45 +01:00			`policy_loss = -torch.mean(log_probs * train_batch[Postprocessing.ADVANTAGES])`
[RLlib] First attempt at cleaning up algo code in RLlib: PG. (#10115) 2020-08-20 17:05:57 +02:00
[RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. (#18937) 2021-10-04 13:29:00 +02:00			`# Store values for stats function in model (tower), such that for`
			`# multi-GPU, we do not override them during the parallel loss phase.`
			`model.tower_stats["policy_loss"] = policy_loss`

			`return policy_loss`
PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650) * Unifying the code for PGTrainer/Policy wrt tf vs torch. Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch). * Fix LINT line-len errors. * Fix LINT errors. * Fix `tf_pg_policy` imports (formerly: `pg_policy`). * Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer. Retire `PGAgent` class (use PGTrainer instead). * - Move PG test into agents/pg/tests directory. - All test cases will be located near the classes that are tested and then built into the Bazel/Travis test suite. * Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c the function is not a tf-specific one. * Fix remaining import errors for agents/pg/... * Fix circular dependency in pg imports. * Add pg tests to Jenkins test suite. 2020-01-02 19:08:03 -05:00

[RLlib] First attempt at cleaning up algo code in RLlib: PG. (#10115) 2020-08-20 17:05:57 +02:00			`def pg_loss_stats(policy: Policy, train_batch: SampleBatch) -> Dict[str, TensorType]:`
			`"""Returns the calculated loss in a stats dict.`

			`Args:`
			`policy (Policy): The Policy object.`
			`train_batch (SampleBatch): The data used for training.`

			`Returns:`
			`Dict[str, TensorType]: The stats dict.`
			`"""`

			`return {`
[RLlib] Issue 18812: Torch multi-GPU stats not protected against race conditions. (#18937) 2021-10-04 13:29:00 +02:00			`"policy_loss": torch.mean(torch.stack(policy.get_tower_stats("policy_loss"))),`
[RLlib] First attempt at cleaning up algo code in RLlib: PG. (#10115) 2020-08-20 17:05:57 +02:00			`}`
PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650) * Unifying the code for PGTrainer/Policy wrt tf vs torch. Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch). * Fix LINT line-len errors. * Fix LINT errors. * Fix `tf_pg_policy` imports (formerly: `pg_policy`). * Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer. Retire `PGAgent` class (use PGTrainer instead). * - Move PG test into agents/pg/tests directory. - All test cases will be located near the classes that are tested and then built into the Bazel/Travis test suite. * Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c the function is not a tf-specific one. * Fix remaining import errors for agents/pg/... * Fix circular dependency in pg imports. * Add pg tests to Jenkins test suite. 2020-01-02 19:08:03 -05:00

[RLlib] Fix typo in docstring of PGTorchPolicy (#23818) 2022-04-11 10:31:45 -07:00			# Build a child class of `TorchPolicy`, given the extra options:
[RLlib] First attempt at cleaning up algo code in RLlib: PG. (#10115) 2020-08-20 17:05:57 +02:00			`# - trajectory post-processing function (to calculate advantages)`
			`# - PG loss function`
[RLlib] JAXPolicy prep. PR #1. (#13077) 2020-12-26 20:14:18 -05:00			`PGTorchPolicy = build_policy_class(`
PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650) * Unifying the code for PGTrainer/Policy wrt tf vs torch. Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch). * Fix LINT line-len errors. * Fix LINT errors. * Fix `tf_pg_policy` imports (formerly: `pg_policy`). * Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer. Retire `PGAgent` class (use PGTrainer instead). * - Move PG test into agents/pg/tests directory. - All test cases will be located near the classes that are tested and then built into the Bazel/Travis test suite. * Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c the function is not a tf-specific one. * Fix remaining import errors for agents/pg/... * Fix circular dependency in pg imports. * Add pg tests to Jenkins test suite. 2020-01-02 19:08:03 -05:00			`name="PGTorchPolicy",`
[RLlib] JAXPolicy prep. PR #1. (#13077) 2020-12-26 20:14:18 -05:00			`framework="torch",`
[RLlib] Agents to algos: DQN w/o Apex and R2D2, DDPG/TD3, SAC, SlateQ, QMIX, PG, Bandits (#24896) 2022-05-19 09:30:42 -07:00			`get_default_config=lambda: ray.rllib.algorithms.pg.DEFAULT_CONFIG,`
PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650) * Unifying the code for PGTrainer/Policy wrt tf vs torch. Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch). * Fix LINT line-len errors. * Fix LINT errors. * Fix `tf_pg_policy` imports (formerly: `pg_policy`). * Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer. Retire `PGAgent` class (use PGTrainer instead). * - Move PG test into agents/pg/tests directory. - All test cases will be located near the classes that are tested and then built into the Bazel/Travis test suite. * Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c the function is not a tf-specific one. * Fix remaining import errors for agents/pg/... * Fix circular dependency in pg imports. * Add pg tests to Jenkins test suite. 2020-01-02 19:08:03 -05:00			`loss_fn=pg_torch_loss,`
			`stats_fn=pg_loss_stats,`
[RLlib] Trajectory view API: Simple List Collector (on by default for PPO); LSTM-agnostic (#11056) 2020-10-01 16:57:10 +02:00			`postprocess_fn=post_process_advantages,`
			`)`