ray/rllib/examples/twostep_game.py

"""The two-step game from QMIX: https://arxiv.org/pdf/1803.11485.pdf

Configurations you can try:
    - normal policy gradients (PG)
    - contrib/MADDPG
    - QMIX

See also: centralized_critic.py for centralized critic PPO on this game.
"""

import argparse
from gym.spaces import Tuple, MultiDiscrete, Dict, Discrete

import ray
from ray import tune
from ray.tune import register_env, grid_search
from ray.rllib.env.multi_agent_env import ENV_STATE
from ray.rllib.examples.env.two_step_game import TwoStepGame
from ray.rllib.utils.test_utils import check_learning_achieved

parser = argparse.ArgumentParser()
parser.add_argument("--run", type=str, default="PG")
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument("--as-test", action="store_true")
parser.add_argument("--torch", action="store_true")
parser.add_argument("--stop-reward", type=float, default=7.0)
parser.add_argument("--stop-timesteps", type=int, default=50000)

if __name__ == "__main__":
    args = parser.parse_args()

    grouping = {
        "group_1": [0, 1],
    }
    obs_space = Tuple([
        Dict({
            "obs": MultiDiscrete([2, 2, 2, 3]),
            ENV_STATE: MultiDiscrete([2, 2, 2])
        }),
        Dict({
            "obs": MultiDiscrete([2, 2, 2, 3]),
            ENV_STATE: MultiDiscrete([2, 2, 2])
        }),
    ])
    act_space = Tuple([
        TwoStepGame.action_space,
        TwoStepGame.action_space,
    ])
    register_env(
        "grouped_twostep",
        lambda config: TwoStepGame(config).with_agent_groups(
            grouping, obs_space=obs_space, act_space=act_space))

    if args.run == "contrib/MADDPG":
        obs_space_dict = {
            "agent_1": Discrete(6),
            "agent_2": Discrete(6),
        }
        act_space_dict = {
            "agent_1": TwoStepGame.action_space,
            "agent_2": TwoStepGame.action_space,
        }
        config = {
            "learning_starts": 100,
            "env_config": {
                "actions_are_logits": True,
            },
            "multiagent": {
                "policies": {
                    "pol1": (None, Discrete(6), TwoStepGame.action_space, {
                        "agent_id": 0,
                    }),
                    "pol2": (None, Discrete(6), TwoStepGame.action_space, {
                        "agent_id": 1,
                    }),
                },
                "policy_mapping_fn": lambda x: "pol1" if x == 0 else "pol2",
            },
            "framework": "torch" if args.torch else "tf",
        }
        group = False
    elif args.run == "QMIX":
        config = {
            "rollout_fragment_length": 4,
            "train_batch_size": 32,
            "exploration_fraction": .4,
            "exploration_final_eps": 0.0,
            "num_workers": 0,
            "mixer": grid_search([None, "qmix", "vdn"]),
            "env_config": {
                "separate_state_space": True,
                "one_hot_state_encoding": True
            },
            "framework": "torch" if args.torch else "tf",
        }
        group = True
    else:
        config = {"framework": "torch" if args.torch else "tf"}
        group = False

    ray.init(num_cpus=args.num_cpus or None)

    stop = {
        "episode_reward_mean": args.stop_reward,
        "timesteps_total": args.stop_timesteps,
    }

    config = dict(config, **{
        "env": "grouped_twostep" if group else TwoStepGame,
    })

    results = tune.run(args.run, stop=stop, config=config)

    if args.as_test:
        check_learning_achieved(results, args.stop_reward)

    ray.shutdown()
[rllib] Centralized critic / PPO example on TwoStepGame (#5392) 2019-08-08 14:03:28 -07:00			`"""The two-step game from QMIX: https://arxiv.org/pdf/1803.11485.pdf`

			`Configurations you can try:`
			`- normal policy gradients (PG)`
			`- contrib/MADDPG`
			`- QMIX`

			`See also: centralized_critic.py for centralized critic PPO on this game.`
			`"""`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
			`import argparse`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`from gym.spaces import Tuple, MultiDiscrete, Dict, Discrete`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
			`import ray`
[rllib] Switch to tune.run() instead of run_experiments() (#4515) 2019-03-30 14:07:50 -07:00			`from ray import tune`
			`from ray.tune import register_env, grid_search`
[RLlib] rllib/examples folder restructuring (#8250) Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well). 2020-05-01 22:59:34 +02:00			`from ray.rllib.env.multi_agent_env import ENV_STATE`
			`from ray.rllib.examples.env.two_step_game import TwoStepGame`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`from ray.rllib.utils.test_utils import check_learning_achieved`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
			`parser = argparse.ArgumentParser()`
[rllib] Raise an error if multi-agent envs terminate without a last observation for agents (#4139) * fix it * lint * Update rllib-training.rst 2019-02-23 21:23:40 -08:00			`parser.add_argument("--run", type=str, default="PG")`
[RLlib] Move all jenkins RLlib-tests into bazel (rllib/BUILD). (#7178) * commit * comment 2020-02-15 23:50:44 +01:00			`parser.add_argument("--num-cpus", type=int, default=0)`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`parser.add_argument("--as-test", action="store_true")`
			`parser.add_argument("--torch", action="store_true")`
			`parser.add_argument("--stop-reward", type=float, default=7.0)`
			`parser.add_argument("--stop-timesteps", type=int, default=50000)`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
			`if __name__ == "__main__":`
			`args = parser.parse_args()`

			`grouping = {`
MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`"group_1": [0, 1],`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`}`
			`obs_space = Tuple([`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`Dict({`
			`"obs": MultiDiscrete([2, 2, 2, 3]),`
			`ENV_STATE: MultiDiscrete([2, 2, 2])`
			`}),`
			`Dict({`
			`"obs": MultiDiscrete([2, 2, 2, 3]),`
			`ENV_STATE: MultiDiscrete([2, 2, 2])`
			`}),`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`])`
			`act_space = Tuple([`
			`TwoStepGame.action_space,`
			`TwoStepGame.action_space,`
			`])`
			`register_env(`
			`"grouped_twostep",`
			`lambda config: TwoStepGame(config).with_agent_groups(`
			`grouping, obs_space=obs_space, act_space=act_space))`

MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`if args.run == "contrib/MADDPG":`
			`obs_space_dict = {`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`"agent_1": Discrete(6),`
			`"agent_2": Discrete(6),`
MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`}`
			`act_space_dict = {`
			`"agent_1": TwoStepGame.action_space,`
			`"agent_2": TwoStepGame.action_space,`
			`}`
			`config = {`
			`"learning_starts": 100,`
			`"env_config": {`
			`"actions_are_logits": True,`
			`},`
			`"multiagent": {`
			`"policies": {`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`"pol1": (None, Discrete(6), TwoStepGame.action_space, {`
			`"agent_id": 0,`
			`}),`
			`"pol2": (None, Discrete(6), TwoStepGame.action_space, {`
			`"agent_id": 1,`
			`}),`
MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`},`
[tune] Deprecate tune.function (#5601) * remove tune function * remove examples * Update tune-usage.rst 2019-08-31 16:00:10 -07:00			`"policy_mapping_fn": lambda x: "pol1" if x == 0 else "pol2",`
MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`},`
[RLlib] Auto-framework, retire `use_pytorch` in favor of `framework=...` (#8520) 2020-05-27 16:19:13 +02:00			`"framework": "torch" if args.torch else "tf",`
MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`}`
			`group = False`
			`elif args.run == "QMIX":`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`config = {`
[rllib] Rename sample_batch_size => rollout_fragment_length (#7503) * bulk rename * deprecation warn * update doc * update fig * line length * rename * make pytest comptaible * fix test * fi sys * rename * wip * fix more * lint * update svg * comments * lint * fix use of batch steps 2020-03-14 12:05:04 -07:00			`"rollout_fragment_length": 4,`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`"train_batch_size": 32,`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`"exploration_fraction": .4,`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`"exploration_final_eps": 0.0,`
			`"num_workers": 0,`
			`"mixer": grid_search([None, "qmix", "vdn"]),`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`"env_config": {`
			`"separate_state_space": True,`
			`"one_hot_state_encoding": True`
			`},`
[RLlib] Auto-framework, retire `use_pytorch` in favor of `framework=...` (#8520) 2020-05-27 16:19:13 +02:00			`"framework": "torch" if args.torch else "tf",`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`}`
[rllib] Raise an error if multi-agent envs terminate without a last observation for agents (#4139) * fix it * lint * Update rllib-training.rst 2019-02-23 21:23:40 -08:00			`group = True`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`else:`
[RLlib] Auto-framework, retire `use_pytorch` in favor of `framework=...` (#8520) 2020-05-27 16:19:13 +02:00			`config = {"framework": "torch" if args.torch else "tf"}`
[rllib] Raise an error if multi-agent envs terminate without a last observation for agents (#4139) * fix it * lint * Update rllib-training.rst 2019-02-23 21:23:40 -08:00			`group = False`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
[RLlib] Move all jenkins RLlib-tests into bazel (rllib/BUILD). (#7178) * commit * comment 2020-02-15 23:50:44 +01:00			`ray.init(num_cpus=args.num_cpus or None)`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00
			`stop = {`
			`"episode_reward_mean": args.stop_reward,`
			`"timesteps_total": args.stop_timesteps,`
			`}`

			`config = dict(config, **{`
			`"env": "grouped_twostep" if group else TwoStepGame,`
			`})`

			`results = tune.run(args.run, stop=stop, config=config)`

			`if args.as_test:`
			`check_learning_achieved(results, args.stop_reward)`

			`ray.shutdown()`