ray/rllib/agents/marwil/marwil.py

from ray.rllib.agents.trainer import with_common_config
from ray.rllib.agents.trainer_template import build_trainer
from ray.rllib.agents.marwil.marwil_tf_policy import MARWILTFPolicy
from ray.rllib.optimizers import SyncBatchReplayOptimizer

# yapf: disable
# __sphinx_doc_begin__
DEFAULT_CONFIG = with_common_config({
    # You should override this to point to an offline dataset (see agent.py).
    "input": "sampler",
    # Use importance sampling estimators for reward
    "input_evaluation": ["is", "wis"],

    # Scaling of advantages in exponential terms
    # When beta is 0, MARWIL is reduced to imitation learning
    "beta": 1.0,
    # Balancing value estimation loss and policy optimization loss
    "vf_coeff": 1.0,
    # Whether to calculate cumulative rewards
    "postprocess_inputs": True,
    # Whether to rollout "complete_episodes" or "truncate_episodes"
    "batch_mode": "complete_episodes",
    # Learning rate for adam optimizer
    "lr": 1e-4,
    # Number of timesteps collected for each SGD round
    "train_batch_size": 2000,
    # Number of steps max to keep in the batch replay buffer
    "replay_buffer_size": 100000,
    # Number of steps to read before learning starts
    "learning_starts": 0,
    # === Parallelism ===
    "num_workers": 0,
    # Use PyTorch as framework?
    "use_pytorch": False
})
# __sphinx_doc_end__
# yapf: enable


def make_optimizer(workers, config):
    return SyncBatchReplayOptimizer(
        workers,
        learning_starts=config["learning_starts"],
        buffer_size=config["replay_buffer_size"],
        train_batch_size=config["train_batch_size"],
    )


def get_policy_class(config):
    if config.get("use_pytorch") is True:
        from ray.rllib.agents.marwil.marwil_torch_policy import \
            MARWILTorchPolicy
        return MARWILTorchPolicy
    else:
        return MARWILTFPolicy


MARWILTrainer = build_trainer(
    name="MARWIL",
    default_config=DEFAULT_CONFIG,
    default_policy=MARWILTFPolicy,
    get_policy_class=get_policy_class,
    make_policy_optimizer=make_optimizer)
[rllib] Port remainder of algorithms to build_trainer() pattern (#4920) 2019-06-07 16:45:36 -07:00			`from ray.rllib.agents.trainer import with_common_config`
			`from ray.rllib.agents.trainer_template import build_trainer`
[RLlib] MARWIL torch. (#7836) * WIP. * WIP. * LINT. * Fix MARWIL so it can run with eager-mode. * LINT. 2020-04-07 01:38:50 +02:00			`from ray.rllib.agents.marwil.marwil_tf_policy import MARWILTFPolicy`
[rllib] Develop MARWIL (#3635) * add marvil policy graph * fix typo * add offline optimizer and enable running marwil * fix loss function * add maintaining the moving average of advantage norm * use sync replay optimizer for unifying * remove offline optimizer and use sync replay optimizer * format by yapf * add imitation learning objective * fix according to eric's review * format by yapf * revise * add test data * marwil 2019-01-17 11:00:43 +08:00			`from ray.rllib.optimizers import SyncBatchReplayOptimizer`

			`# yapf: disable`
			`# __sphinx_doc_begin__`
			`DEFAULT_CONFIG = with_common_config({`
[rllib] Use model.value_function() in MARWIL (#4036) * fix marwil * add ph * fix 2019-02-14 19:35:21 -08:00			`# You should override this to point to an offline dataset (see agent.py).`
			`"input": "sampler",`
			`# Use importance sampling estimators for reward`
			`"input_evaluation": ["is", "wis"],`

[rllib] Develop MARWIL (#3635) * add marvil policy graph * fix typo * add offline optimizer and enable running marwil * fix loss function * add maintaining the moving average of advantage norm * use sync replay optimizer for unifying * remove offline optimizer and use sync replay optimizer * format by yapf * add imitation learning objective * fix according to eric's review * format by yapf * revise * add test data * marwil 2019-01-17 11:00:43 +08:00			`# Scaling of advantages in exponential terms`
			`# When beta is 0, MARWIL is reduced to imitation learning`
			`"beta": 1.0,`
			`# Balancing value estimation loss and policy optimization loss`
			`"vf_coeff": 1.0,`
			`# Whether to calculate cumulative rewards`
			`"postprocess_inputs": True,`
			`# Whether to rollout "complete_episodes" or "truncate_episodes"`
			`"batch_mode": "complete_episodes",`
			`# Learning rate for adam optimizer`
			`"lr": 1e-4,`
			`# Number of timesteps collected for each SGD round`
			`"train_batch_size": 2000,`
			`# Number of steps max to keep in the batch replay buffer`
			`"replay_buffer_size": 100000,`
			`# Number of steps to read before learning starts`
			`"learning_starts": 0,`
			`# === Parallelism ===`
			`"num_workers": 0,`
[RLlib] MARWIL torch. (#7836) * WIP. * WIP. * LINT. * Fix MARWIL so it can run with eager-mode. * LINT. 2020-04-07 01:38:50 +02:00			`# Use PyTorch as framework?`
			`"use_pytorch": False`
[rllib] Develop MARWIL (#3635) * add marvil policy graph * fix typo * add offline optimizer and enable running marwil * fix loss function * add maintaining the moving average of advantage norm * use sync replay optimizer for unifying * remove offline optimizer and use sync replay optimizer * format by yapf * add imitation learning objective * fix according to eric's review * format by yapf * revise * add test data * marwil 2019-01-17 11:00:43 +08:00			`})`
			`# __sphinx_doc_end__`
			`# yapf: enable`


[rllib] Port remainder of algorithms to build_trainer() pattern (#4920) 2019-06-07 16:45:36 -07:00			`def make_optimizer(workers, config):`
			`return SyncBatchReplayOptimizer(`
			`workers,`
			`learning_starts=config["learning_starts"],`
			`buffer_size=config["replay_buffer_size"],`
			`train_batch_size=config["train_batch_size"],`
			`)`
[rllib] Develop MARWIL (#3635) * add marvil policy graph * fix typo * add offline optimizer and enable running marwil * fix loss function * add maintaining the moving average of advantage norm * use sync replay optimizer for unifying * remove offline optimizer and use sync replay optimizer * format by yapf * add imitation learning objective * fix according to eric's review * format by yapf * revise * add test data * marwil 2019-01-17 11:00:43 +08:00

[RLlib] MARWIL torch. (#7836) * WIP. * WIP. * LINT. * Fix MARWIL so it can run with eager-mode. * LINT. 2020-04-07 01:38:50 +02:00			`def get_policy_class(config):`
			`if config.get("use_pytorch") is True:`
			`from ray.rllib.agents.marwil.marwil_torch_policy import \`
			`MARWILTorchPolicy`
			`return MARWILTorchPolicy`
			`else:`
			`return MARWILTFPolicy`
[RLlib] Add `torch` flag to train.py (#6807) 2020-01-18 03:48:44 +01:00

[rllib] Port remainder of algorithms to build_trainer() pattern (#4920) 2019-06-07 16:45:36 -07:00			`MARWILTrainer = build_trainer(`
			`name="MARWIL",`
			`default_config=DEFAULT_CONFIG,`
[RLlib] Update MARWIL to use tf policy template (#6975) * update MARWIL to use tf policy template * formatting fixes 2020-01-31 20:57:52 +00:00			`default_policy=MARWILTFPolicy,`
[RLlib] MARWIL torch. (#7836) * WIP. * WIP. * LINT. * Fix MARWIL so it can run with eager-mode. * LINT. 2020-04-07 01:38:50 +02:00			`get_policy_class=get_policy_class,`
[RLlib] Update MARWIL to use tf policy template (#6975) * update MARWIL to use tf policy template * formatting fixes 2020-01-31 20:57:52 +00:00			`make_policy_optimizer=make_optimizer)`