ray/rllib/examples/multi_agent_two_trainers.py

"""Example of using two different training methods at once in multi-agent.

Here we create a number of CartPole agents, some of which are trained with
DQN, and some of which are trained with PPO. We periodically sync weights
between the two trainers (note that no such syncing is needed when using just
a single training method).

For a simpler example, see also: multiagent_cartpole.py
"""

import argparse
import gym
import os

import ray
from ray.rllib.agents.dqn import DQNTrainer, DQNTFPolicy, DQNTorchPolicy
from ray.rllib.agents.ppo import PPOTrainer, PPOTFPolicy, PPOTorchPolicy
from ray.rllib.examples.env.multi_agent import MultiAgentCartPole
from ray.tune.logger import pretty_print
from ray.tune.registry import register_env

parser = argparse.ArgumentParser()
# Use torch for both policies.
parser.add_argument("--torch", action="store_true")
parser.add_argument("--as-test", action="store_true")
parser.add_argument("--stop-iters", type=int, default=20)
parser.add_argument("--stop-reward", type=float, default=50)
parser.add_argument("--stop-timesteps", type=int, default=100000)

if __name__ == "__main__":
    args = parser.parse_args()

    ray.init()

    # Simple environment with 4 independent cartpole entities
    register_env("multi_agent_cartpole",
                 lambda _: MultiAgentCartPole({"num_agents": 4}))
    single_dummy_env = gym.make("CartPole-v0")
    obs_space = single_dummy_env.observation_space
    act_space = single_dummy_env.action_space

    # You can also have multiple policies per trainer, but here we just
    # show one each for PPO and DQN.
    policies = {
        "ppo_policy": (PPOTorchPolicy if args.torch else PPOTFPolicy,
                       obs_space, act_space, {}),
        "dqn_policy": (DQNTorchPolicy if args.torch else DQNTFPolicy,
                       obs_space, act_space, {}),
    }

    def policy_mapping_fn(agent_id):
        if agent_id % 2 == 0:
            return "ppo_policy"
        else:
            return "dqn_policy"

    ppo_trainer = PPOTrainer(
        env="multi_agent_cartpole",
        config={
            "multiagent": {
                "policies": policies,
                "policy_mapping_fn": policy_mapping_fn,
                "policies_to_train": ["ppo_policy"],
            },
            "explore": False,
            # disable filters, otherwise we would need to synchronize those
            # as well to the DQN agent
            "observation_filter": "NoFilter",
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
            "framework": "torch" if args.torch else "tf",
        })

    dqn_trainer = DQNTrainer(
        env="multi_agent_cartpole",
        config={
            "multiagent": {
                "policies": policies,
                "policy_mapping_fn": policy_mapping_fn,
                "policies_to_train": ["dqn_policy"],
            },
            "gamma": 0.95,
            "n_step": 3,
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
            "framework": "torch" if args.torch else "tf"
        })

    # You should see both the printed X and Y approach 200 as this trains:
    # info:
    #   policy_reward_mean:
    #     dqn_policy: X
    #     ppo_policy: Y
    for i in range(args.stop_iters):
        print("== Iteration", i, "==")

        # improve the DQN policy
        print("-- DQN --")
        result_dqn = dqn_trainer.train()
        print(pretty_print(result_dqn))

        # improve the PPO policy
        print("-- PPO --")
        result_ppo = ppo_trainer.train()
        print(pretty_print(result_ppo))

        # Test passed gracefully.
        if args.as_test and \
                result_dqn["episode_reward_mean"] > args.stop_reward and \
                result_ppo["episode_reward_mean"] > args.stop_reward:
            print("test passed (both agents above requested reward)")
            quit(0)

        # swap weights to synchronize
        dqn_trainer.set_weights(ppo_trainer.get_weights(["ppo_policy"]))
        ppo_trainer.set_weights(dqn_trainer.get_weights(["dqn_policy"]))

    # Desired reward not reached.
    if args.as_test:
        raise ValueError("Desired reward ({}) not reached!".format(
            args.stop_reward))
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`"""Example of using two different training methods at once in multi-agent.`

			`Here we create a number of CartPole agents, some of which are trained with`
			`DQN, and some of which are trained with PPO. We periodically sync weights`
			`between the two trainers (note that no such syncing is needed when using just`
			`a single training method).`

			`For a simpler example, see also: multiagent_cartpole.py`
			`"""`

			`import argparse`
			`import gym`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			`import os`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00
			`import ray`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`from ray.rllib.agents.dqn import DQNTrainer, DQNTFPolicy, DQNTorchPolicy`
			`from ray.rllib.agents.ppo import PPOTrainer, PPOTFPolicy, PPOTorchPolicy`
[RLlib] rllib/examples folder restructuring (#8250) Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well). 2020-05-01 22:59:34 +02:00			`from ray.rllib.examples.env.multi_agent import MultiAgentCartPole`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`from ray.tune.logger import pretty_print`
			`from ray.tune.registry import register_env`

			`parser = argparse.ArgumentParser()`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`# Use torch for both policies.`
			`parser.add_argument("--torch", action="store_true")`
			`parser.add_argument("--as-test", action="store_true")`
			`parser.add_argument("--stop-iters", type=int, default=20)`
			`parser.add_argument("--stop-reward", type=float, default=50)`
			`parser.add_argument("--stop-timesteps", type=int, default=100000)`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00
			`if __name__ == "__main__":`
			`args = parser.parse_args()`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`ray.init()`

			`# Simple environment with 4 independent cartpole entities`
[RLlib] rllib/examples folder restructuring (#8250) Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well). 2020-05-01 22:59:34 +02:00			`register_env("multi_agent_cartpole",`
			`lambda _: MultiAgentCartPole({"num_agents": 4}))`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			`single_dummy_env = gym.make("CartPole-v0")`
			`obs_space = single_dummy_env.observation_space`
			`act_space = single_dummy_env.action_space`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00
[rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (#4819) This implements some of the renames proposed in #4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. 2019-05-20 16:46:05 -07:00			`# You can also have multiple policies per trainer, but here we just`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`# show one each for PPO and DQN.`
[rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (#4819) This implements some of the renames proposed in #4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. 2019-05-20 16:46:05 -07:00			`policies = {`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`"ppo_policy": (PPOTorchPolicy if args.torch else PPOTFPolicy,`
			`obs_space, act_space, {}),`
[RLlib] Remove flaky test case for mixed (tf+torch) policies trainer. (#14357) 2021-02-25 23:07:05 +01:00			`"dqn_policy": (DQNTorchPolicy if args.torch else DQNTFPolicy,`
			`obs_space, act_space, {}),`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`}`

			`def policy_mapping_fn(agent_id):`
			`if agent_id % 2 == 0:`
			`return "ppo_policy"`
			`else:`
			`return "dqn_policy"`

[rllib] Rename Agent to Trainer (#4556) 2019-04-07 00:36:18 -07:00			`ppo_trainer = PPOTrainer(`
[RLlib] rllib/examples folder restructuring (#8250) Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well). 2020-05-01 22:59:34 +02:00			`env="multi_agent_cartpole",`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`config={`
			`"multiagent": {`
[rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (#4819) This implements some of the renames proposed in #4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. 2019-05-20 16:46:05 -07:00			`"policies": policies,`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`"policy_mapping_fn": policy_mapping_fn,`
			`"policies_to_train": ["ppo_policy"],`
			`},`
[RLlib] Exploration API: merge deterministic flag with exploration classes (SoftQ and StochasticSampling). (#7155) 2020-02-19 21:18:45 +01:00			`"explore": False,`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`# disable filters, otherwise we would need to synchronize those`
			`# as well to the DQN agent`
			`"observation_filter": "NoFilter",`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
			`"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),`
[RLlib] Auto-framework, retire `use_pytorch` in favor of `framework=...` (#8520) 2020-05-27 16:19:13 +02:00			`"framework": "torch" if args.torch else "tf",`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`})`

[rllib] Rename Agent to Trainer (#4556) 2019-04-07 00:36:18 -07:00			`dqn_trainer = DQNTrainer(`
[RLlib] rllib/examples folder restructuring (#8250) Cleans up of the rllib/examples folder by moving all example Envs into rllibexamples/env (so they can be used by other scripts and tests as well). 2020-05-01 22:59:34 +02:00			`env="multi_agent_cartpole",`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`config={`
			`"multiagent": {`
[rllib] Rename PolicyGraph => Policy, move from evaluation/ to policy/ (#4819) This implements some of the renames proposed in #4813 We leave behind backwards-compatibility aliases for *PolicyGraph and SampleBatch. 2019-05-20 16:46:05 -07:00			`"policies": policies,`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`"policy_mapping_fn": policy_mapping_fn,`
			`"policies_to_train": ["dqn_policy"],`
			`},`
			`"gamma": 0.95,`
			`"n_step": 3,`
[RLlib] Fix all example scripts to run on GPUs. (#11105) 2020-10-02 23:07:44 +02:00			# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
			`"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),`
[RLlib] Remove flaky test case for mixed (tf+torch) policies trainer. (#14357) 2021-02-25 23:07:05 +01:00			`"framework": "torch" if args.torch else "tf"`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`})`

			`# You should see both the printed X and Y approach 200 as this trains:`
			`# info:`
			`# policy_reward_mean:`
			`# dqn_policy: X`
			`# ppo_policy: Y`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`for i in range(args.stop_iters):`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00			`print("== Iteration", i, "==")`

			`# improve the DQN policy`
			`print("-- DQN --")`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`result_dqn = dqn_trainer.train()`
			`print(pretty_print(result_dqn))`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00
			`# improve the PPO policy`
			`print("-- PPO --")`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00			`result_ppo = ppo_trainer.train()`
			`print(pretty_print(result_ppo))`

			`# Test passed gracefully.`
			`if args.as_test and \`
			`result_dqn["episode_reward_mean"] > args.stop_reward and \`
			`result_ppo["episode_reward_mean"] > args.stop_reward:`
			`print("test passed (both agents above requested reward)")`
			`quit(0)`
[rllib] Better support and add two-trainer example for multiagent (#2443) This adds a simple DQN+PPO example for multi-agent. We don't do anything fancy here, just syncing weights between two separate trainers. This potentially is wasting some compute, but is very simple to set up. It might be nice to share experience collection between the top-level trainers in the future. 2018-07-22 05:09:25 -07:00
			`# swap weights to synchronize`
			`dqn_trainer.set_weights(ppo_trainer.get_weights(["ppo_policy"]))`
			`ppo_trainer.set_weights(dqn_trainer.get_weights(["dqn_policy"]))`
[RLlib] Examples folder restructuring (Model examples; final part). (#8278) - This PR completes any previously missing PyTorch Model counterparts to TFModels in examples/models. - It also makes sure, all example scripts in the rllib/examples folder are tested for both frameworks and learn the given task (this is often currently not checked) using a --as-test flag in connection with a --stop-reward. 2020-05-12 08:23:10 +02:00
			`# Desired reward not reached.`
			`if args.as_test:`
			`raise ValueError("Desired reward ({}) not reached!".format(`
			`args.stop_reward))`