ray/rllib/examples/multi_agent_two_trainers.py

"""Example of using two different training methods at once in multi-agent.

Here we create a number of CartPole agents, some of which are trained with
DQN, and some of which are trained with PPO. We periodically sync weights
between the two trainers (note that no such syncing is needed when using just
a single training method).

For a simpler example, see also: multiagent_cartpole.py
"""

import argparse
import gym
import os

import ray
from ray.rllib.agents.dqn import DQNTrainer, DQNTFPolicy, DQNTorchPolicy
from ray.rllib.agents.ppo import PPOTrainer, PPOTFPolicy, PPOTorchPolicy
from ray.rllib.examples.env.multi_agent import MultiAgentCartPole
from ray.tune.logger import pretty_print
from ray.tune.registry import register_env

parser = argparse.ArgumentParser()
# Use torch for both policies.
parser.add_argument(
    "--framework",
    choices=["tf", "tf2", "tfe", "torch"],
    default="tf",
    help="The DL framework specifier.")
parser.add_argument(
    "--as-test",
    action="store_true",
    help="Whether this script should be run as a test: --stop-reward must "
    "be achieved within --stop-timesteps AND --stop-iters.")
parser.add_argument(
    "--stop-iters",
    type=int,
    default=20,
    help="Number of iterations to train.")
parser.add_argument(
    "--stop-timesteps",
    type=int,
    default=100000,
    help="Number of timesteps to train.")
parser.add_argument(
    "--stop-reward",
    type=float,
    default=50.0,
    help="Reward at which we stop training.")

if __name__ == "__main__":
    args = parser.parse_args()

    ray.init()

    # Simple environment with 4 independent cartpole entities
    register_env("multi_agent_cartpole",
                 lambda _: MultiAgentCartPole({"num_agents": 4}))
    single_dummy_env = gym.make("CartPole-v0")
    obs_space = single_dummy_env.observation_space
    act_space = single_dummy_env.action_space

    # You can also have multiple policies per trainer, but here we just
    # show one each for PPO and DQN.
    policies = {
        "ppo_policy": (PPOTorchPolicy if args.framework == "torch" else
                       PPOTFPolicy, obs_space, act_space, {}),
        "dqn_policy": (DQNTorchPolicy if args.framework == "torch" else
                       DQNTFPolicy, obs_space, act_space, {}),
    }

    def policy_mapping_fn(agent_id, episode, **kwargs):
        if agent_id % 2 == 0:
            return "ppo_policy"
        else:
            return "dqn_policy"

    ppo_trainer = PPOTrainer(
        env="multi_agent_cartpole",
        config={
            "multiagent": {
                "policies": policies,
                "policy_mapping_fn": policy_mapping_fn,
                "policies_to_train": ["ppo_policy"],
            },
            "model": {
                "vf_share_layers": True,
            },
            "num_sgd_iter": 6,
            "vf_loss_coeff": 0.01,
            # disable filters, otherwise we would need to synchronize those
            # as well to the DQN agent
            "observation_filter": "MeanStdFilter",
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
            "framework": args.framework,
        })

    dqn_trainer = DQNTrainer(
        env="multi_agent_cartpole",
        config={
            "multiagent": {
                "policies": policies,
                "policy_mapping_fn": policy_mapping_fn,
                "policies_to_train": ["dqn_policy"],
            },
            "model": {
                "vf_share_layers": True,
            },
            "gamma": 0.95,
            "n_step": 3,
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
            "framework": args.framework,
        })

    # You should see both the printed X and Y approach 200 as this trains:
    # info:
    #   policy_reward_mean:
    #     dqn_policy: X
    #     ppo_policy: Y
    for i in range(args.stop_iters):
        print("== Iteration", i, "==")

        # improve the DQN policy
        print("-- DQN --")
        result_dqn = dqn_trainer.train()
        print(pretty_print(result_dqn))

        # improve the PPO policy
        print("-- PPO --")
        result_ppo = ppo_trainer.train()
        print(pretty_print(result_ppo))

        # Test passed gracefully.
        if args.as_test and \
                result_dqn["episode_reward_mean"] > args.stop_reward and \
                result_ppo["episode_reward_mean"] > args.stop_reward:
            print("test passed (both agents above requested reward)")
            quit(0)

        # swap weights to synchronize
        dqn_trainer.set_weights(ppo_trainer.get_weights(["ppo_policy"]))
        ppo_trainer.set_weights(dqn_trainer.get_weights(["dqn_policy"]))

    # Desired reward not reached.
    if args.as_test:
        raise ValueError("Desired reward ({}) not reached!".format(
            args.stop_reward))
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`"""Example of using two different training methods at once in multi-agent.`

			`Here we create a number of CartPole agents, some of which are trained with`
			`DQN, and some of which are trained with PPO. We periodically sync weights`
			`between the two trainers (note that no such syncing is needed when using just`
			`a single training method).`

			`For a simpler example, see also: multiagent_cartpole.py`
			`"""`

			`import argparse`
			`import gym`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			`import os`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00
			`import ray`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`from ray.rllib.agents.dqn import DQNTrainer, DQNTFPolicy, DQNTorchPolicy`
			`from ray.rllib.agents.ppo import PPOTrainer, PPOTFPolicy, PPOTorchPolicy`
[RLlib] rllib/examples folder restructuring (#8250) Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well). 2020-05-01 22:59:34 +02:00			`from ray.rllib.examples.env.multi_agent import MultiAgentCartPole`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`from ray.tune.logger import pretty_print`
			`from ray.tune.registry import register_env`

			`parser = argparse.ArgumentParser()`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`# Use torch for both policies.`
[RLlib] Examples scripts add argparse help and replace `--torch` with `--framework`. (#15832) 2021-05-18 13:18:12 +02:00			`parser.add_argument(`
			`"--framework",`
			`choices=["tf", "tf2", "tfe", "torch"],`
			`default="tf",`
			`help="The DL framework specifier.")`
			`parser.add_argument(`
			`"--as-test",`
			`action="store_true",`
			`help="Whether this script should be run as a test: --stop-reward must "`
			`"be achieved within --stop-timesteps AND --stop-iters.")`
			`parser.add_argument(`
			`"--stop-iters",`
			`type=int,`
			`default=20,`
			`help="Number of iterations to train.")`
			`parser.add_argument(`
			`"--stop-timesteps",`
			`type=int,`
			`default=100000,`
			`help="Number of timesteps to train.")`
			`parser.add_argument(`
			`"--stop-reward",`
			`type=float,`
			`default=50.0,`
			`help="Reward at which we stop training.")`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00
			`if __name__ == "__main__":`
			`args = parser.parse_args()`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`ray.init()`

			`# Simple environment with 4 independent cartpole entities`
[RLlib] rllib/examples folder restructuring (#8250) Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well). 2020-05-01 22:59:34 +02:00			`register_env("multi_agent_cartpole",`
			`lambda _: MultiAgentCartPole({"num_agents": 4}))`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			`single_dummy_env = gym.make("CartPole-v0")`
			`obs_space = single_dummy_env.observation_space`
			`act_space = single_dummy_env.action_space`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00
[rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (#4819) This implements some of the renames proposed in #4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. 2019-05-20 16:46:05 -07:00			`# You can also have multiple policies per trainer, but here we just`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`# show one each for PPO and DQN.`
[rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (#4819) This implements some of the renames proposed in #4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. 2019-05-20 16:46:05 -07:00			`policies = {`
[RLlib] Examples scripts add argparse help and replace `--torch` with `--framework`. (#15832) 2021-05-18 13:18:12 +02:00			`"ppo_policy": (PPOTorchPolicy if args.framework == "torch" else`
			`PPOTFPolicy, obs_space, act_space, {}),`
			`"dqn_policy": (DQNTorchPolicy if args.framework == "torch" else`
			`DQNTFPolicy, obs_space, act_space, {}),`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`}`

[RLlib] Re-do: Trainer: Support add and delete Policies. (#16569) 2021-06-21 13:46:01 +02:00			`def policy_mapping_fn(agent_id, episode, **kwargs):`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`if agent_id % 2 == 0:`
			`return "ppo_policy"`
			`else:`
			`return "dqn_policy"`

[rllib] Rename Agent to Trainer (#4556) 2019-04-07 00:36:18 -07:00			`ppo_trainer = PPOTrainer(`
[RLlib] rllib/examples folder restructuring (#8250) Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well). 2020-05-01 22:59:34 +02:00			`env="multi_agent_cartpole",`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`config={`
			`"multiagent": {`
[rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (#4819) This implements some of the renames proposed in #4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. 2019-05-20 16:46:05 -07:00			`"policies": policies,`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`"policy_mapping_fn": policy_mapping_fn,`
			`"policies_to_train": ["ppo_policy"],`
			`},`
[RLlib] Serve + RLlib example script. (#14416) 2021-03-03 14:33:03 +01:00			`"model": {`
			`"vf_share_layers": True,`
			`},`
			`"num_sgd_iter": 6,`
			`"vf_loss_coeff": 0.01,`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`# disable filters, otherwise we would need to synchronize those`
			`# as well to the DQN agent`
[RLlib] Serve + RLlib example script. (#14416) 2021-03-03 14:33:03 +01:00			`"observation_filter": "MeanStdFilter",`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
			`"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),`
[RLlib] Examples scripts add argparse help and replace `--torch` with `--framework`. (#15832) 2021-05-18 13:18:12 +02:00			`"framework": args.framework,`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`})`

[rllib] Rename Agent to Trainer (#4556) 2019-04-07 00:36:18 -07:00			`dqn_trainer = DQNTrainer(`
[RLlib] rllib/examples folder restructuring (#8250) Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well). 2020-05-01 22:59:34 +02:00			`env="multi_agent_cartpole",`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`config={`
			`"multiagent": {`
[rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (#4819) This implements some of the renames proposed in #4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. 2019-05-20 16:46:05 -07:00			`"policies": policies,`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`"policy_mapping_fn": policy_mapping_fn,`
			`"policies_to_train": ["dqn_policy"],`
			`},`
[RLlib] Serve + RLlib example script. (#14416) 2021-03-03 14:33:03 +01:00			`"model": {`
			`"vf_share_layers": True,`
			`},`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`"gamma": 0.95,`
			`"n_step": 3,`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
			`"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),`
[RLlib] Examples scripts add argparse help and replace `--torch` with `--framework`. (#15832) 2021-05-18 13:18:12 +02:00			`"framework": args.framework,`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`})`

			`# You should see both the printed X and Y approach 200 as this trains:`
			`# info:`
			`# policy_reward_mean:`
			`# dqn_policy: X`
			`# ppo_policy: Y`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`for i in range(args.stop_iters):`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`print("== Iteration", i, "==")`

			`# improve the DQN policy`
			`print("-- DQN --")`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`result_dqn = dqn_trainer.train()`
			`print(pretty_print(result_dqn))`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00
			`# improve the PPO policy`
			`print("-- PPO --")`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`result_ppo = ppo_trainer.train()`
			`print(pretty_print(result_ppo))`

			`# Test passed gracefully.`
			`if args.as_test and \`
			`result_dqn["episode_reward_mean"] > args.stop_reward and \`
			`result_ppo["episode_reward_mean"] > args.stop_reward:`
			`print("test passed (both agents above requested reward)")`
			`quit(0)`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00
			`# swap weights to synchronize`
			`dqn_trainer.set_weights(ppo_trainer.get_weights(["ppo_policy"]))`
			`ppo_trainer.set_weights(dqn_trainer.get_weights(["dqn_policy"]))`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00
			`# Desired reward not reached.`
			`if args.as_test:`
			`raise ValueError("Desired reward ({}) not reached!".format(`
			`args.stop_reward))`