ray/rllib/agents/ppo/appo.py

from ray.rllib.agents.impala.impala import validate_config
from ray.rllib.agents.ppo.appo_tf_policy import AsyncPPOTFPolicy
from ray.rllib.agents.ppo.ppo import update_kl
from ray.rllib.agents.trainer import with_base_config
from ray.rllib.agents import impala

# yapf: disable
# __sphinx_doc_begin__
DEFAULT_CONFIG = with_base_config(impala.DEFAULT_CONFIG, {
    # Whether to use V-trace weighted advantages. If false, PPO GAE advantages
    # will be used instead.
    "vtrace": False,

    # == These two options only apply if vtrace: False ==
    # Should use a critic as a baseline (otherwise don't use value baseline;
    # required for using GAE).
    "use_critic": True,
    # If true, use the Generalized Advantage Estimator (GAE)
    # with a value function, see https://arxiv.org/pdf/1506.02438.pdf.
    "use_gae": True,
    # GAE(lambda) parameter
    "lambda": 1.0,

    # == PPO surrogate loss options ==
    "clip_param": 0.4,

    # == PPO KL Loss options ==
    "use_kl_loss": False,
    "kl_coeff": 1.0,
    "kl_target": 0.01,

    # == IMPALA optimizer params (see documentation in impala.py) ==
    "rollout_fragment_length": 50,
    "train_batch_size": 500,
    "min_iter_time_s": 10,
    "num_workers": 2,
    "num_gpus": 0,
    "num_data_loader_buffers": 1,
    "minibatch_buffer_size": 1,
    "num_sgd_iter": 1,
    "replay_proportion": 0.0,
    "replay_buffer_num_slots": 100,
    "learner_queue_size": 16,
    "learner_queue_timeout": 300,
    "max_sample_requests_in_flight_per_worker": 2,
    "broadcast_interval": 1,
    "grad_clip": 40.0,
    "opt_type": "adam",
    "lr": 0.0005,
    "lr_schedule": None,
    "decay": 0.99,
    "momentum": 0.0,
    "epsilon": 0.1,
    "vf_loss_coeff": 0.5,
    "entropy_coeff": 0.01,
    "entropy_coeff_schedule": None,

    # TODO: impl update target.
    "use_exec_api": False,
})
# __sphinx_doc_end__
# yapf: enable


def update_target_and_kl(trainer, fetches):
    # Update the KL coeff depending on how many steps LearnerThread has stepped
    # through
    learner_steps = trainer.optimizer.learner.num_steps
    if learner_steps >= trainer.target_update_frequency:

        # Update Target Network
        trainer.optimizer.learner.num_steps = 0
        trainer.workers.local_worker().foreach_trainable_policy(
            lambda p, _: p.update_target())

        # Also update KL Coeff
        if trainer.config["use_kl_loss"]:
            update_kl(trainer, trainer.optimizer.learner.stats)


def initialize_target(trainer):
    trainer.workers.local_worker().foreach_trainable_policy(
        lambda p, _: p.update_target())
    trainer.target_update_frequency = trainer.config["num_sgd_iter"] \
        * trainer.config["minibatch_buffer_size"]


def get_policy_class(config):
    if config.get("use_pytorch") is True:
        from ray.rllib.agents.ppo.appo_torch_policy import AsyncPPOTorchPolicy
        return AsyncPPOTorchPolicy
    else:
        return AsyncPPOTFPolicy


APPOTrainer = impala.ImpalaTrainer.with_updates(
    name="APPO",
    default_config=DEFAULT_CONFIG,
    validate_config=validate_config,
    default_policy=AsyncPPOTFPolicy,
    get_policy_class=get_policy_class,
    after_init=initialize_target,
    after_optimizer_step=update_target_and_kl)
[RLlib] IMPALA PyTorch (#8287) This PR adds an IMPALA PyTorch implementation. - adds compilation tests for LSTM and w/o LSTM. - adds learning test for CartPole. 2020-05-03 13:44:25 +02:00			`from ray.rllib.agents.impala.impala import validate_config`
[RLlib] PyTorch version of APPO. (#8120) - Translate all vtrace functionality to torch and added torch to the framework_iterator-loop in all existing vtrace test cases. - Add learning test cases for APPO torch (both w/ and w/o v-trace). - Add quick compilation tests for APPO (tf and torch, v-trace and no v-trace). 2020-04-23 09:11:12 +02:00			`from ray.rllib.agents.ppo.appo_tf_policy import AsyncPPOTFPolicy`
Remove future imports (#6724) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error. 2020-01-09 09:15:48 +01:00			`from ray.rllib.agents.ppo.ppo import update_kl`
[RLlib] IMPALA PyTorch (#8287) This PR adds an IMPALA PyTorch implementation. - adds compilation tests for LSTM and w/o LSTM. - adds learning test for CartPole. 2020-05-03 13:44:25 +02:00			`from ray.rllib.agents.trainer import with_base_config`
Remove future imports (#6724) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error. 2020-01-09 09:15:48 +01:00			`from ray.rllib.agents import impala`

			`# yapf: disable`
			`# __sphinx_doc_begin__`
			`DEFAULT_CONFIG = with_base_config(impala.DEFAULT_CONFIG, {`
			`# Whether to use V-trace weighted advantages. If false, PPO GAE advantages`
			`# will be used instead.`
			`"vtrace": False,`

			`# == These two options only apply if vtrace: False ==`
[rllib] implemented compute_advantages without gae (#6941) 2020-02-01 08:25:45 +02:00			`# Should use a critic as a baseline (otherwise don't use value baseline;`
			`# required for using GAE).`
			`"use_critic": True,`
Remove future imports (#6724) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error. 2020-01-09 09:15:48 +01:00			`# If true, use the Generalized Advantage Estimator (GAE)`
			`# with a value function, see https://arxiv.org/pdf/1506.02438.pdf.`
			`"use_gae": True,`
			`# GAE(lambda) parameter`
			`"lambda": 1.0,`

			`# == PPO surrogate loss options ==`
			`"clip_param": 0.4,`

			`# == PPO KL Loss options ==`
			`"use_kl_loss": False,`
			`"kl_coeff": 1.0,`
			`"kl_target": 0.01,`

			`# == IMPALA optimizer params (see documentation in impala.py) ==`
[rllib] Rename sample_batch_size => rollout_fragment_length (#7503) * bulk rename * deprecation warn * update doc * update fig * line length * rename * make pytest comptaible * fix test * fi sys * rename * wip * fix more * lint * update svg * comments * lint * fix use of batch steps 2020-03-14 12:05:04 -07:00			`"rollout_fragment_length": 50,`
Remove future imports (#6724) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error. 2020-01-09 09:15:48 +01:00			`"train_batch_size": 500,`
			`"min_iter_time_s": 10,`
			`"num_workers": 2,`
			`"num_gpus": 0,`
			`"num_data_loader_buffers": 1,`
			`"minibatch_buffer_size": 1,`
			`"num_sgd_iter": 1,`
			`"replay_proportion": 0.0,`
			`"replay_buffer_num_slots": 100,`
			`"learner_queue_size": 16,`
			`"learner_queue_timeout": 300,`
			`"max_sample_requests_in_flight_per_worker": 2,`
			`"broadcast_interval": 1,`
			`"grad_clip": 40.0,`
			`"opt_type": "adam",`
			`"lr": 0.0005,`
			`"lr_schedule": None,`
			`"decay": 0.99,`
			`"momentum": 0.0,`
			`"epsilon": 0.1,`
			`"vf_loss_coeff": 0.5,`
			`"entropy_coeff": 0.01,`
			`"entropy_coeff_schedule": None,`
[rllib] Distributed exec workflow for impala (#8321) 2020-05-11 20:24:43 -07:00
			`# TODO: impl update target.`
			`"use_exec_api": False,`
Remove future imports (#6724) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error. 2020-01-09 09:15:48 +01:00			`})`
			`# __sphinx_doc_end__`
			`# yapf: enable`


			`def update_target_and_kl(trainer, fetches):`
			`# Update the KL coeff depending on how many steps LearnerThread has stepped`
			`# through`
			`learner_steps = trainer.optimizer.learner.num_steps`
			`if learner_steps >= trainer.target_update_frequency:`

			`# Update Target Network`
			`trainer.optimizer.learner.num_steps = 0`
			`trainer.workers.local_worker().foreach_trainable_policy(`
			`lambda p, _: p.update_target())`

			`# Also update KL Coeff`
			`if trainer.config["use_kl_loss"]:`
			`update_kl(trainer, trainer.optimizer.learner.stats)`


			`def initialize_target(trainer):`
			`trainer.workers.local_worker().foreach_trainable_policy(`
			`lambda p, _: p.update_target())`
			`trainer.target_update_frequency = trainer.config["num_sgd_iter"] \`
			`* trainer.config["minibatch_buffer_size"]`


[RLlib] PyTorch version of APPO. (#8120) - Translate all vtrace functionality to torch and added torch to the framework_iterator-loop in all existing vtrace test cases. - Add learning test cases for APPO torch (both w/ and w/o v-trace). - Add quick compilation tests for APPO (tf and torch, v-trace and no v-trace). 2020-04-23 09:11:12 +02:00			`def get_policy_class(config):`
			`if config.get("use_pytorch") is True:`
			`from ray.rllib.agents.ppo.appo_torch_policy import AsyncPPOTorchPolicy`
			`return AsyncPPOTorchPolicy`
			`else:`
			`return AsyncPPOTFPolicy`


Remove future imports (#6724) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error. 2020-01-09 09:15:48 +01:00			`APPOTrainer = impala.ImpalaTrainer.with_updates(`
			`name="APPO",`
			`default_config=DEFAULT_CONFIG,`
[RLlib] PyTorch version of APPO. (#8120) - Translate all vtrace functionality to torch and added torch to the framework_iterator-loop in all existing vtrace test cases. - Add learning test cases for APPO torch (both w/ and w/o v-trace). - Add quick compilation tests for APPO (tf and torch, v-trace and no v-trace). 2020-04-23 09:11:12 +02:00			`validate_config=validate_config,`
Remove future imports (#6724) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error. 2020-01-09 09:15:48 +01:00			`default_policy=AsyncPPOTFPolicy,`
[RLlib] PyTorch version of APPO. (#8120) - Translate all vtrace functionality to torch and added torch to the framework_iterator-loop in all existing vtrace test cases. - Add learning test cases for APPO torch (both w/ and w/o v-trace). - Add quick compilation tests for APPO (tf and torch, v-trace and no v-trace). 2020-04-23 09:11:12 +02:00			`get_policy_class=get_policy_class,`
Remove future imports (#6724) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error. 2020-01-09 09:15:48 +01:00			`after_init=initialize_target,`
			`after_optimizer_step=update_target_and_kl)`