ray/rllib/examples/two_step_game.py

"""The two-step game from QMIX: https://arxiv.org/pdf/1803.11485.pdf

Configurations you can try:
    - normal policy gradients (PG)
    - contrib/MADDPG
    - QMIX

See also: centralized_critic.py for centralized critic PPO on this game.
"""

import argparse
from gym.spaces import Tuple, MultiDiscrete, Dict, Discrete
import os

import ray
from ray import tune
from ray.tune import register_env, grid_search
from ray.rllib.env.multi_agent_env import ENV_STATE
from ray.rllib.examples.env.two_step_game import TwoStepGame
from ray.rllib.utils.test_utils import check_learning_achieved

parser = argparse.ArgumentParser()
parser.add_argument(
    "--run",
    type=str,
    default="PG",
    help="The RLlib-registered algorithm to use.")
parser.add_argument(
    "--framework",
    choices=["tf", "tf2", "tfe", "torch"],
    default="tf",
    help="The DL framework specifier.")
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument(
    "--as-test",
    action="store_true",
    help="Whether this script should be run as a test: --stop-reward must "
    "be achieved within --stop-timesteps AND --stop-iters.")
parser.add_argument(
    "--stop-iters",
    type=int,
    default=200,
    help="Number of iterations to train.")
parser.add_argument(
    "--stop-timesteps",
    type=int,
    default=50000,
    help="Number of timesteps to train.")
parser.add_argument(
    "--stop-reward",
    type=float,
    default=7.0,
    help="Reward at which we stop training.")

if __name__ == "__main__":
    args = parser.parse_args()

    grouping = {
        "group_1": [0, 1],
    }
    obs_space = Tuple([
        Dict({
            "obs": MultiDiscrete([2, 2, 2, 3]),
            ENV_STATE: MultiDiscrete([2, 2, 2])
        }),
        Dict({
            "obs": MultiDiscrete([2, 2, 2, 3]),
            ENV_STATE: MultiDiscrete([2, 2, 2])
        }),
    ])
    act_space = Tuple([
        TwoStepGame.action_space,
        TwoStepGame.action_space,
    ])
    register_env(
        "grouped_twostep",
        lambda config: TwoStepGame(config).with_agent_groups(
            grouping, obs_space=obs_space, act_space=act_space))

    if args.run == "contrib/MADDPG":
        obs_space_dict = {
            "agent_1": Discrete(6),
            "agent_2": Discrete(6),
        }
        act_space_dict = {
            "agent_1": TwoStepGame.action_space,
            "agent_2": TwoStepGame.action_space,
        }
        config = {
            "learning_starts": 100,
            "env_config": {
                "actions_are_logits": True,
            },
            "multiagent": {
                "policies": {
                    "pol1": (None, Discrete(6), TwoStepGame.action_space, {
                        "agent_id": 0,
                    }),
                    "pol2": (None, Discrete(6), TwoStepGame.action_space, {
                        "agent_id": 1,
                    }),
                },
                "policy_mapping_fn": lambda x: "pol1" if x == 0 else "pol2",
            },
            "framework": args.framework,
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
        }
        group = False
    elif args.run == "QMIX":
        config = {
            "rollout_fragment_length": 4,
            "train_batch_size": 32,
            "exploration_config": {
                "epsilon_timesteps": 5000,
                "final_epsilon": 0.05,
            },
            "num_workers": 0,
            "mixer": grid_search([None, "qmix", "vdn"]),
            "env_config": {
                "separate_state_space": True,
                "one_hot_state_encoding": True
            },
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
            "framework": args.framework,
        }
        group = True
    else:
        config = {
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
            "framework": args.framework,
        }
        group = False

    ray.init(num_cpus=args.num_cpus or None)

    stop = {
        "episode_reward_mean": args.stop_reward,
        "timesteps_total": args.stop_timesteps,
        "training_iteration": args.stop_iters,
    }

    config = dict(config, **{
        "env": "grouped_twostep" if group else TwoStepGame,
    })

    results = tune.run(args.run, stop=stop, config=config, verbose=1)

    if args.as_test:
        check_learning_achieved(results, args.stop_reward)

    ray.shutdown()
[rllib] Centralized critic / PPO example on TwoStepGame (#5392) 2019-08-08 14:03:28 -07:00			`"""The two-step game from QMIX: https://arxiv.org/pdf/1803.11485.pdf`

			`Configurations you can try:`
			`- normal policy gradients (PG)`
			`- contrib/MADDPG`
			`- QMIX`

			`See also: centralized_critic.py for centralized critic PPO on this game.`
			`"""`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
			`import argparse`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`from gym.spaces import Tuple, MultiDiscrete, Dict, Discrete`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			`import os`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
			`import ray`
[rllib] Switch to tune.run() instead of run_experiments() (#4515) 2019-03-30 14:07:50 -07:00			`from ray import tune`
			`from ray.tune import register_env, grid_search`
[RLlib] rllib/examples folder restructuring (#8250) Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well). 2020-05-01 22:59:34 +02:00			`from ray.rllib.env.multi_agent_env import ENV_STATE`
			`from ray.rllib.examples.env.two_step_game import TwoStepGame`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`from ray.rllib.utils.test_utils import check_learning_achieved`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
			`parser = argparse.ArgumentParser()`
[RLlib] Examples scripts add argparse help and replace `--torch` with `--framework`. (#15832) 2021-05-18 13:18:12 +02:00			`parser.add_argument(`
			`"--run",`
			`type=str,`
			`default="PG",`
			`help="The RLlib-registered algorithm to use.")`
			`parser.add_argument(`
			`"--framework",`
			`choices=["tf", "tf2", "tfe", "torch"],`
			`default="tf",`
			`help="The DL framework specifier.")`
[RLlib] Move all jenkins RLlib-tests into bazel (rllib/BUILD). (#7178) * commit * comment 2020-02-15 23:50:44 +01:00			`parser.add_argument("--num-cpus", type=int, default=0)`
[RLlib] Examples scripts add argparse help and replace `--torch` with `--framework`. (#15832) 2021-05-18 13:18:12 +02:00			`parser.add_argument(`
			`"--as-test",`
			`action="store_true",`
			`help="Whether this script should be run as a test: --stop-reward must "`
			`"be achieved within --stop-timesteps AND --stop-iters.")`
			`parser.add_argument(`
			`"--stop-iters",`
			`type=int,`
			`default=200,`
			`help="Number of iterations to train.")`
			`parser.add_argument(`
			`"--stop-timesteps",`
			`type=int,`
			`default=50000,`
			`help="Number of timesteps to train.")`
			`parser.add_argument(`
			`"--stop-reward",`
			`type=float,`
			`default=7.0,`
			`help="Reward at which we stop training.")`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
			`if __name__ == "__main__":`
			`args = parser.parse_args()`

			`grouping = {`
MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`"group_1": [0, 1],`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`}`
			`obs_space = Tuple([`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`Dict({`
			`"obs": MultiDiscrete([2, 2, 2, 3]),`
			`ENV_STATE: MultiDiscrete([2, 2, 2])`
			`}),`
			`Dict({`
			`"obs": MultiDiscrete([2, 2, 2, 3]),`
			`ENV_STATE: MultiDiscrete([2, 2, 2])`
			`}),`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`])`
			`act_space = Tuple([`
			`TwoStepGame.action_space,`
			`TwoStepGame.action_space,`
			`])`
			`register_env(`
			`"grouped_twostep",`
			`lambda config: TwoStepGame(config).with_agent_groups(`
			`grouping, obs_space=obs_space, act_space=act_space))`

MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`if args.run == "contrib/MADDPG":`
			`obs_space_dict = {`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`"agent_1": Discrete(6),`
			`"agent_2": Discrete(6),`
MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`}`
			`act_space_dict = {`
			`"agent_1": TwoStepGame.action_space,`
			`"agent_2": TwoStepGame.action_space,`
			`}`
			`config = {`
			`"learning_starts": 100,`
			`"env_config": {`
			`"actions_are_logits": True,`
			`},`
			`"multiagent": {`
			`"policies": {`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`"pol1": (None, Discrete(6), TwoStepGame.action_space, {`
			`"agent_id": 0,`
			`}),`
			`"pol2": (None, Discrete(6), TwoStepGame.action_space, {`
			`"agent_id": 1,`
			`}),`
MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`},`
Revert "[RLlib] Allow policies to be added/deleted on the fly. (#16359)" (#16543) This reverts commit e78ec370a979a14e295344eff1a5ac6e22a07335. 2021-06-18 12:21:49 -07:00			`"policy_mapping_fn": lambda x: "pol1" if x == 0 else "pol2",`
MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`},`
[RLlib] Examples scripts add argparse help and replace `--torch` with `--framework`. (#15832) 2021-05-18 13:18:12 +02:00			`"framework": args.framework,`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
			`"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),`
MADDPG implementation in RLlib (#5348) 2019-08-06 19:22:06 -04:00			`}`
			`group = False`
			`elif args.run == "QMIX":`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`config = {`
[rllib] Rename sample_batch_size => rollout_fragment_length (#7503) * bulk rename * deprecation warn * update doc * update fig * line length * rename * make pytest comptaible * fix test * fi sys * rename * wip * fix more * lint * update svg * comments * lint * fix use of batch steps 2020-03-14 12:05:04 -07:00			`"rollout_fragment_length": 4,`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`"train_batch_size": 32,`
[RLlib] Issue 8384: QMIX doesn't learn anything. (#9527) 2020-07-17 12:14:34 +02:00			`"exploration_config": {`
			`"epsilon_timesteps": 5000,`
			`"final_epsilon": 0.05,`
			`},`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`"num_workers": 0,`
			`"mixer": grid_search([None, "qmix", "vdn"]),`
Qmix on gpu and with non-stacked-obs environment state support (#5751) 2019-10-08 13:18:07 -07:00			`"env_config": {`
			`"separate_state_space": True,`
			`"one_hot_state_encoding": True`
			`},`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
			`"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),`
[RLlib] Examples scripts add argparse help and replace `--torch` with `--framework`. (#15832) 2021-05-18 13:18:12 +02:00			`"framework": args.framework,`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`}`
[rllib] Raise an error if multi-agent envs terminate without a last observation for agents (#4139) * fix it * lint * Update rllib-training.rst 2019-02-23 21:23:40 -08:00			`group = True`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00			`else:`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			`config = {`
			# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
			`"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),`
[RLlib] Examples scripts add argparse help and replace `--torch` with `--framework`. (#15832) 2021-05-18 13:18:12 +02:00			`"framework": args.framework,`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			`}`
[rllib] Raise an error if multi-agent envs terminate without a last observation for agents (#4139) * fix it * lint * Update rllib-training.rst 2019-02-23 21:23:40 -08:00			`group = False`
[rllib] Q-Mix implementation (Q-Mix, VDN, IQN, and Ape-X variants) (#3548) 2018-12-18 10:40:01 -08:00
[RLlib] Move all jenkins RLlib-tests into bazel (rllib/BUILD). (#7178) * commit * comment 2020-02-15 23:50:44 +01:00			`ray.init(num_cpus=args.num_cpus or None)`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00
			`stop = {`
			`"episode_reward_mean": args.stop_reward,`
			`"timesteps_total": args.stop_timesteps,`
[RLlib] Examples scripts add argparse help and replace `--torch` with `--framework`. (#15832) 2021-05-18 13:18:12 +02:00			`"training_iteration": args.stop_iters,`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`}`

			`config = dict(config, **{`
			`"env": "grouped_twostep" if group else TwoStepGame,`
			`})`

[RLlib] Issue 8384: QMIX doesn't learn anything. (#9527) 2020-07-17 12:14:34 +02:00			`results = tune.run(args.run, stop=stop, config=config, verbose=1)`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00
			`if args.as_test:`
			`check_learning_achieved(results, args.stop_reward)`

			`ray.shutdown()`