ray/python/ray/rllib/agents/ppo/appo.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from ray.rllib.agents.ppo.appo_policy_graph import AsyncPPOTFPolicy
from ray.rllib.agents.trainer import with_base_config
from ray.rllib.agents import impala
from ray.rllib.utils.annotations import override

# yapf: disable
# __sphinx_doc_begin__
DEFAULT_CONFIG = with_base_config(impala.DEFAULT_CONFIG, {
    # Whether to use V-trace weighted advantages. If false, PPO GAE advantages
    # will be used instead.
    "vtrace": False,

    # == These two options only apply if vtrace: False ==
    # If true, use the Generalized Advantage Estimator (GAE)
    # with a value function, see https://arxiv.org/pdf/1506.02438.pdf.
    "use_gae": True,
    # GAE(lambda) parameter
    "lambda": 1.0,

    # == PPO surrogate loss options ==
    "clip_param": 0.4,

    # == IMPALA optimizer params (see documentation in impala.py) ==
    "sample_batch_size": 50,
    "train_batch_size": 500,
    "min_iter_time_s": 10,
    "num_workers": 2,
    "num_gpus": 1,
    "num_data_loader_buffers": 1,
    "minibatch_buffer_size": 1,
    "num_sgd_iter": 1,
    "replay_proportion": 0.0,
    "replay_buffer_num_slots": 100,
    "learner_queue_size": 16,
    "max_sample_requests_in_flight_per_worker": 2,
    "broadcast_interval": 1,
    "grad_clip": 40.0,
    "opt_type": "adam",
    "lr": 0.0005,
    "lr_schedule": None,
    "decay": 0.99,
    "momentum": 0.0,
    "epsilon": 0.1,
    "vf_loss_coeff": 0.5,
    "entropy_coeff": 0.01,
})
# __sphinx_doc_end__
# yapf: enable


class APPOTrainer(impala.ImpalaTrainer):
    """PPO surrogate loss with IMPALA-architecture."""

    _name = "APPO"
    _default_config = DEFAULT_CONFIG
    _policy_graph = AsyncPPOTFPolicy

    @override(impala.ImpalaTrainer)
    def _get_policy_graph(self):
        return AsyncPPOTFPolicy
Appo (#3779) * Deleted old fork, updated new ray and moved PPO-impala to APPO in ppo folder * Deleted unneccesary vtrace.py file * Update pong-impala.yaml * Cleaned PPO Code * Update pong-impala.yaml * Update pong-impala.yaml * wip * new ifle * refactor * add vtrace off option * revert * support any space * docs * fix comment * remove kl * Update cartpole-appo-vtrace.yaml 2019-01-18 13:40:26 -08:00			`from __future__ import absolute_import`
			`from __future__ import division`
			`from __future__ import print_function`

[rllib] [RFC] Dynamic definition of loss functions and modularization support (#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph 2019-05-18 00:23:11 -07:00			`from ray.rllib.agents.ppo.appo_policy_graph import AsyncPPOTFPolicy`
[rllib] Rename Agent to Trainer (#4556) 2019-04-07 00:36:18 -07:00			`from ray.rllib.agents.trainer import with_base_config`
Appo (#3779) * Deleted old fork, updated new ray and moved PPO-impala to APPO in ppo folder * Deleted unneccesary vtrace.py file * Update pong-impala.yaml * Cleaned PPO Code * Update pong-impala.yaml * Update pong-impala.yaml * wip * new ifle * refactor * add vtrace off option * revert * support any space * docs * fix comment * remove kl * Update cartpole-appo-vtrace.yaml 2019-01-18 13:40:26 -08:00			`from ray.rllib.agents import impala`
			`from ray.rllib.utils.annotations import override`

			`# yapf: disable`
			`# __sphinx_doc_begin__`
			`DEFAULT_CONFIG = with_base_config(impala.DEFAULT_CONFIG, {`
			`# Whether to use V-trace weighted advantages. If false, PPO GAE advantages`
			`# will be used instead.`
			`"vtrace": False,`

			`# == These two options only apply if vtrace: False ==`
			`# If true, use the Generalized Advantage Estimator (GAE)`
			`# with a value function, see https://arxiv.org/pdf/1506.02438.pdf.`
			`"use_gae": True,`
			`# GAE(lambda) parameter`
			`"lambda": 1.0,`

			`# == PPO surrogate loss options ==`
			`"clip_param": 0.4,`

			`# == IMPALA optimizer params (see documentation in impala.py) ==`
			`"sample_batch_size": 50,`
			`"train_batch_size": 500,`
			`"min_iter_time_s": 10,`
			`"num_workers": 2,`
			`"num_gpus": 1,`
			`"num_data_loader_buffers": 1,`
			`"minibatch_buffer_size": 1,`
			`"num_sgd_iter": 1,`
			`"replay_proportion": 0.0,`
			`"replay_buffer_num_slots": 100,`
[rllib] Replay buffer for IMPALA should default to 0 slots. (#3971) * disable replay * make lq configurable * leak test * Update run_multi_node_tests.sh 2019-02-08 10:03:11 -08:00			`"learner_queue_size": 16,`
Appo (#3779) * Deleted old fork, updated new ray and moved PPO-impala to APPO in ppo folder * Deleted unneccesary vtrace.py file * Update pong-impala.yaml * Cleaned PPO Code * Update pong-impala.yaml * Update pong-impala.yaml * wip * new ifle * refactor * add vtrace off option * revert * support any space * docs * fix comment * remove kl * Update cartpole-appo-vtrace.yaml 2019-01-18 13:40:26 -08:00			`"max_sample_requests_in_flight_per_worker": 2,`
			`"broadcast_interval": 1,`
			`"grad_clip": 40.0,`
			`"opt_type": "adam",`
			`"lr": 0.0005,`
			`"lr_schedule": None,`
			`"decay": 0.99,`
			`"momentum": 0.0,`
			`"epsilon": 0.1,`
			`"vf_loss_coeff": 0.5,`
[rllib] Flip sign of A2C, IMPALA entropy coefficient; raise DeprecationWarning if negative (#4374) 2019-03-17 18:07:37 -07:00			`"entropy_coeff": 0.01,`
Appo (#3779) * Deleted old fork, updated new ray and moved PPO-impala to APPO in ppo folder * Deleted unneccesary vtrace.py file * Update pong-impala.yaml * Cleaned PPO Code * Update pong-impala.yaml * Update pong-impala.yaml * wip * new ifle * refactor * add vtrace off option * revert * support any space * docs * fix comment * remove kl * Update cartpole-appo-vtrace.yaml 2019-01-18 13:40:26 -08:00			`})`
			`# __sphinx_doc_end__`
			`# yapf: enable`


[rllib] Rename Agent to Trainer (#4556) 2019-04-07 00:36:18 -07:00			`class APPOTrainer(impala.ImpalaTrainer):`
Appo (#3779) * Deleted old fork, updated new ray and moved PPO-impala to APPO in ppo folder * Deleted unneccesary vtrace.py file * Update pong-impala.yaml * Cleaned PPO Code * Update pong-impala.yaml * Update pong-impala.yaml * wip * new ifle * refactor * add vtrace off option * revert * support any space * docs * fix comment * remove kl * Update cartpole-appo-vtrace.yaml 2019-01-18 13:40:26 -08:00			`"""PPO surrogate loss with IMPALA-architecture."""`

[rllib] Rename Agent to Trainer (#4556) 2019-04-07 00:36:18 -07:00			`_name = "APPO"`
Appo (#3779) * Deleted old fork, updated new ray and moved PPO-impala to APPO in ppo folder * Deleted unneccesary vtrace.py file * Update pong-impala.yaml * Cleaned PPO Code * Update pong-impala.yaml * Update pong-impala.yaml * wip * new ifle * refactor * add vtrace off option * revert * support any space * docs * fix comment * remove kl * Update cartpole-appo-vtrace.yaml 2019-01-18 13:40:26 -08:00			`_default_config = DEFAULT_CONFIG`
[rllib] [RFC] Dynamic definition of loss functions and modularization support (#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph 2019-05-18 00:23:11 -07:00			`_policy_graph = AsyncPPOTFPolicy`
Appo (#3779) * Deleted old fork, updated new ray and moved PPO-impala to APPO in ppo folder * Deleted unneccesary vtrace.py file * Update pong-impala.yaml * Cleaned PPO Code * Update pong-impala.yaml * Update pong-impala.yaml * wip * new ifle * refactor * add vtrace off option * revert * support any space * docs * fix comment * remove kl * Update cartpole-appo-vtrace.yaml 2019-01-18 13:40:26 -08:00
[rllib] Rename Agent to Trainer (#4556) 2019-04-07 00:36:18 -07:00			`@override(impala.ImpalaTrainer)`
Appo (#3779) * Deleted old fork, updated new ray and moved PPO-impala to APPO in ppo folder * Deleted unneccesary vtrace.py file * Update pong-impala.yaml * Cleaned PPO Code * Update pong-impala.yaml * Update pong-impala.yaml * wip * new ifle * refactor * add vtrace off option * revert * support any space * docs * fix comment * remove kl * Update cartpole-appo-vtrace.yaml 2019-01-18 13:40:26 -08:00			`def _get_policy_graph(self):`
[rllib] [RFC] Dynamic definition of loss functions and modularization support (#4795) * dynamic graph * wip * clean up * fix * document trainer * wip * initialize the graph using a fake batch * clean up dynamic init * wip * spelling * use builder for ppo pol graph * add ppo graph * fix naming * order * docs * set class name correctly * add torch builder * add custom model support in builder * cleanup * remove underscores * fix py2 compat * Update dynamic_tf_policy_graph.py * Update tracking_dict.py * wip * rename * debug level * rename policy_graph -> policy in new classes * fix test * rename ppo tf policy * port appo too * forgot grads * default policy optimizer * make default config optional * add config to optimizer * use lr by default in optimizer * update * comments * remove optimizer * fix tuple actions support in dynamic tf graph 2019-05-18 00:23:11 -07:00			`return AsyncPPOTFPolicy`