ray/rllib/models/action_dist.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from ray.rllib.utils.annotations import DeveloperAPI


@DeveloperAPI
class ActionDistribution(object):
    """The policy action distribution of an agent.

    Attributes:
        inputs (Tensors): input vector to compute samples from.
        model (ModelV2): reference to model producing the inputs.
    """

    @DeveloperAPI
    def __init__(self, inputs, model):
        """Initialize the action dist.

        Arguments:
            inputs (Tensors): input vector to compute samples from.
            model (ModelV2): reference to model producing the inputs. This
                is mainly useful if you want to use model variables to compute
                action outputs (i.e., for auto-regressive action distributions,
                see examples/autoregressive_action_dist.py).
        """
        self.inputs = inputs
        self.model = model

    @DeveloperAPI
    def sample(self):
        """Draw a sample from the action distribution."""
        raise NotImplementedError

    @DeveloperAPI
    def sampled_action_logp(self):
        """Returns the log probability of the last sampled action."""
        raise NotImplementedError

    @DeveloperAPI
    def logp(self, x):
        """The log-likelihood of the action distribution."""
        raise NotImplementedError

    @DeveloperAPI
    def kl(self, other):
        """The KL-divergence between two action distributions."""
        raise NotImplementedError

    @DeveloperAPI
    def entropy(self):
        """The entropy of the action distribution."""
        raise NotImplementedError

    def multi_kl(self, other):
        """The KL-divergence between two action distributions.

        This differs from kl() in that it can return an array for
        MultiDiscrete. TODO(ekl) consider removing this.
        """
        return self.kl(other)

    def multi_entropy(self):
        """The entropy of the action distribution.

        This differs from entropy() in that it can return an array for
        MultiDiscrete. TODO(ekl) consider removing this.
        """
        return self.entropy()

    @DeveloperAPI
    @staticmethod
    def required_model_output_shape(action_space, model_config):
        """Returns the required shape of an input parameter tensor for a
        particular action space and an optional dict of distribution-specific
        options.

        Args:
            action_space (gym.Space): The action space this distribution will
                be used for, whose shape attributes will be used to determine
                the required shape of the input parameter tensor.
            model_config (dict): Model's config dict (as defined in catalog.py)

        Returns:
            model_output_shape (int or np.ndarray of ints): size of the
                required input vector (minus leading batch dimension).
        """
        raise NotImplementedError
Add policy gradient example. (#344) * add policy gradient example * fix typos * Minor changes plus some documentation. * Minor fixes. 2017-03-07 23:42:44 -08:00			`from __future__ import absolute_import`
			`from __future__ import division`
			`from __future__ import print_function`

[rllib] Document ModelV2 and clean up the models/ directory (#5277) 2019-07-27 02:08:16 -07:00			`from ray.rllib.utils.annotations import DeveloperAPI`
Support older version TF and Support RMSProp in Impala (#2590) to support TF version < 1.5 to support rmsprop optimizer in Impala Before TF1.5, tf.reduce_sum() and tf.reduce_max() has an argument keep_dims which has been renamed as keepdims in later versions. In the original paper of Impala, they use rmsprop algorithm to optimize the model. We'd better also support it so that users can reproduce their experiments. Without any tuning, say that using the same hyper-parameters as AdamOptimizer, it reaches "episode_reward_mean": 19.083333333333332 in Pong after consume 3,610,350 samples. 2018-08-09 19:51:32 -07:00
Add policy gradient example. (#344) * add policy gradient example * fix typos * Minor changes plus some documentation. * Minor fixes. 2017-03-07 23:42:44 -08:00
[rllib] annotate public vs developer vs private APIs (#3808) 2019-01-23 21:27:26 -08:00			`@DeveloperAPI`
[rllib] Pull out shared models for evolution strategies and policy gradient. (#719) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * fix linting * more 4 space * fix * fix linT * oops * es parity 2017-07-17 01:58:54 -07:00			`class ActionDistribution(object):`
			`"""The policy action distribution of an agent.`

[rllib] Autoregressive action distributions (#5304) 2019-08-10 14:05:12 -07:00			`Attributes:`
			`inputs (Tensors): input vector to compute samples from.`
			`model (ModelV2): reference to model producing the inputs.`
[rllib] Pull out shared models for evolution strategies and policy gradient. (#719) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * fix linting * more 4 space * fix * fix linT * oops * es parity 2017-07-17 01:58:54 -07:00			`"""`

[rllib] annotate public vs developer vs private APIs (#3808) 2019-01-23 21:27:26 -08:00			`@DeveloperAPI`
[rllib] Autoregressive action distributions (#5304) 2019-08-10 14:05:12 -07:00			`def __init__(self, inputs, model):`
			`"""Initialize the action dist.`

			`Arguments:`
			`inputs (Tensors): input vector to compute samples from.`
			`model (ModelV2): reference to model producing the inputs. This`
			`is mainly useful if you want to use model variables to compute`
			`action outputs (i.e., for auto-regressive action distributions,`
			`see examples/autoregressive_action_dist.py).`
			`"""`
[rllib] Pull out shared models for evolution strategies and policy gradient. (#719) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * fix linting * more 4 space * fix * fix linT * oops * es parity 2017-07-17 01:58:54 -07:00			`self.inputs = inputs`
[rllib] Autoregressive action distributions (#5304) 2019-08-10 14:05:12 -07:00			`self.model = model`
[rllib] Document ModelV2 and clean up the models/ directory (#5277) 2019-07-27 02:08:16 -07:00
			`@DeveloperAPI`
			`def sample(self):`
			`"""Draw a sample from the action distribution."""`
			`raise NotImplementedError`
[rllib] Pull out shared models for evolution strategies and policy gradient. (#719) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * fix linting * more 4 space * fix * fix linT * oops * es parity 2017-07-17 01:58:54 -07:00
[rllib] Autoregressive action distributions (#5304) 2019-08-10 14:05:12 -07:00			`@DeveloperAPI`
			`def sampled_action_logp(self):`
			`"""Returns the log probability of the last sampled action."""`
			`raise NotImplementedError`

[rllib] annotate public vs developer vs private APIs (#3808) 2019-01-23 21:27:26 -08:00			`@DeveloperAPI`
[rllib] Pull out shared models for evolution strategies and policy gradient. (#719) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * fix linting * more 4 space * fix * fix linT * oops * es parity 2017-07-17 01:58:54 -07:00			`def logp(self, x):`
[rllib] Initial RLLib documentation (#969) * initial documentation for RLLib * more RL documentation * fix linting * fix comments * update * fix 2017-09-12 23:38:21 -07:00			`"""The log-likelihood of the action distribution."""`
[rllib] Pull out shared models for evolution strategies and policy gradient. (#719) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * fix linting * more 4 space * fix * fix linT * oops * es parity 2017-07-17 01:58:54 -07:00			`raise NotImplementedError`

[rllib] annotate public vs developer vs private APIs (#3808) 2019-01-23 21:27:26 -08:00			`@DeveloperAPI`
[rllib] Pull out shared models for evolution strategies and policy gradient. (#719) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * fix linting * more 4 space * fix * fix linT * oops * es parity 2017-07-17 01:58:54 -07:00			`def kl(self, other):`
[rllib] Split docs into user and development guide (#1377) * docs * Update README.rst * Sat Dec 30 15:23:49 PST 2017 * comments * Sun Dec 31 23:33:30 PST 2017 * Sun Dec 31 23:33:38 PST 2017 * Sun Dec 31 23:37:46 PST 2017 * Sun Dec 31 23:39:28 PST 2017 * Sun Dec 31 23:43:05 PST 2017 * Sun Dec 31 23:51:55 PST 2017 * Sun Dec 31 23:52:51 PST 2017 2018-01-01 11:10:44 -08:00			`"""The KL-divergence between two action distributions."""`
[rllib] Pull out shared models for evolution strategies and policy gradient. (#719) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * fix linting * more 4 space * fix * fix linT * oops * es parity 2017-07-17 01:58:54 -07:00			`raise NotImplementedError`

[rllib] annotate public vs developer vs private APIs (#3808) 2019-01-23 21:27:26 -08:00			`@DeveloperAPI`
[rllib] Pull out shared models for evolution strategies and policy gradient. (#719) * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * wip * works with cartpole * lint * fix pg * comment * action dist rename * preprocessor * fix test * typo * fix the action[0] nonsense * revert * satisfy the lint * Minor indentation changes. * fix merge * add humanoid * fix linting * more 4 space * fix * fix linT * oops * es parity 2017-07-17 01:58:54 -07:00			`def entropy(self):`
[rllib] Basic infrastructure for off-policy estimation (IS, WIS) (#3941) 2019-02-13 16:25:05 -08:00			`"""The entropy of the action distribution."""`
			`raise NotImplementedError`

[rllib] MultiCategorical shouldn't return array for kl or entropy (#5215) * wip * fix 2019-07-19 12:12:04 -07:00			`def multi_kl(self, other):`
			`"""The KL-divergence between two action distributions.`

			`This differs from kl() in that it can return an array for`
			`MultiDiscrete. TODO(ekl) consider removing this.`
			`"""`
			`return self.kl(other)`

			`def multi_entropy(self):`
			`"""The entropy of the action distribution.`

			`This differs from entropy() in that it can return an array for`
			`MultiDiscrete. TODO(ekl) consider removing this.`
			`"""`
			`return self.entropy()`
Custom action distributions (#5164) * custom action dist wip * Test case for custom action dist * ActionDistribution.get_parameter_shape_for_action_space pattern * Edit exception message to also suggest using a custom action distribution * Clean up ModelCatalog.get_action_dist * Pass model config to ActionDistribution constructors * Update custom action distribution test case * Name fix * Autoformatter * parameter shape static methods for torch distributions * Fix docstring * Generalize fake array for graph initialization * Fix action dist constructors * Correct parameter shape static methods for multicategorical and gaussian * Make suggested changes to custom action dist's * Correct instances of not passing model config to action dist * Autoformatter * fix tuple distribution constructor * bugfix 2019-08-06 18:13:16 +00:00
			`@DeveloperAPI`
			`@staticmethod`
			`def required_model_output_shape(action_space, model_config):`
			`"""Returns the required shape of an input parameter tensor for a`
			`particular action space and an optional dict of distribution-specific`
			`options.`

			`Args:`
			`action_space (gym.Space): The action space this distribution will`
			`be used for, whose shape attributes will be used to determine`
			`the required shape of the input parameter tensor.`
			`model_config (dict): Model's config dict (as defined in catalog.py)`

			`Returns:`
			`model_output_shape (int or np.ndarray of ints): size of the`
			`required input vector (minus leading batch dimension).`
			`"""`
			`raise NotImplementedError`