[rllib] Add examples page, add hierarchical training example, delete SC2 examples (#3815)

* wip

* lint

* wip

* up

* wip

* update examples

* wip

* remove carla

* update

* improve envspec

* link to custom

* Update rllib-env.rst

* update

* fix

* fn

* lint

* ds

* ssd games

* desc

* fix up docs

* fix
This commit is contained in:
Eric Liang 2019-01-29 21:06:09 -08:00 committed by GitHub
parent c9819a721d
commit fb73cedf70
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
13 changed files with 396 additions and 210 deletions

View file

@ -99,6 +99,7 @@ Ray comes with libraries that accelerate deep learning and reinforcement learnin
rllib-dev.rst
rllib-concepts.rst
rllib-package-ref.rst
rllib-examples.rst
.. toctree::
:maxdepth: 1

View file

@ -62,6 +62,8 @@ You can also register a custom env creator function with a string name. This fun
register_env("my_env", env_creator)
trainer = ppo.PPOAgent(env="my_env")
For a full runnable code example using the custom environment API, see `custom_env.py <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/custom_env.py>`__.
Configuring Environments
------------------------
@ -103,6 +105,10 @@ There are two ways to scale experience collection with Gym environments:
You can also combine vectorization and distributed execution, as shown in the above figure. Here we plot just the throughput of RLlib policy evaluation from 1 to 128 CPUs. PongNoFrameskip-v4 on GPU scales from 2.4k to 200k actions/s, and Pendulum-v0 on CPU from 15k to 1.5M actions/s. One machine was used for 1-16 workers, and a Ray cluster of four machines for 32-128 workers. Each worker was configured with ``num_envs_per_worker=64``.
Expensive Environments
~~~~~~~~~~~~~~~~~~~~~~
Some environments may be very resource-intensive to create. RLlib will create ``num_workers + 1`` copies of the environment since one copy is needed for the driver process. To avoid paying the extra overhead of the driver copy, which is needed to access the env's action and observation spaces, you can defer environment initialization until ``reset()`` is called.
Vectorized
----------
@ -234,6 +240,8 @@ This can be implemented as a multi-agent environment with three types of agents.
In this setup, the appropriate rewards for training lower-level agents must be provided by the multi-agent env implementation. The environment class is also responsible for routing between the agents, e.g., conveying `goals <https://arxiv.org/pdf/1703.01161.pdf>`__ from higher-level agents to lower-level agents as part of the lower-level agent observation.
See this file for a runnable example: `hierarchical_training.py <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/hierarchical_training.py>`__.
Grouping Agents
~~~~~~~~~~~~~~~

View file

@ -0,0 +1,67 @@
RLlib Examples
==============
This page is an index of examples for the various use cases and features of RLlib.
If any example is broken, or if you'd like to add an example to this page, feel free to raise an issue on our Github repository.
Tuned Examples
--------------
- `Tuned examples <https://github.com/ray-project/ray/blob/master/python/ray/rllib/tuned_examples>`__:
Collection of tuned algorithm hyperparameters.
- `Atari benchmarks <https://github.com/ray-project/rl-experiments>`__:
Collection of reasonably optimized Atari results.
Training Workflows
------------------
- `Custom training workflows <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/custom_train_fn.py>`__:
Example of how to use Tune's support for custom training functions to implement custom training workflows.
- `Curriculum learning <rllib-training.html#example-curriculum-learning>`__:
Example of how to adjust the configuration of an environment over time.
- `Custom metrics <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/custom_metrics_and_callbacks.py>`__:
Example of how to output custom training metrics to TensorBoard.
Custom Envs and Models
----------------------
- `Registering a custom env <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/custom_env.py>`__:
Example of defining and registering a gym env for use with RLlib.
- `Subprocess environment <https://github.com/ray-project/ray/blob/master/python/ray/rllib/test/test_env_with_subprocess.py>`__:
Example of how to ensure subprocesses spawned by envs are killed when RLlib exits.
- `Batch normalization <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/batch_norm_model.py>`__:
Example of adding batch norm layers to a custom model.
- `Parametric actions <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/parametric_action_cartpole.py>`__:
Example of how to handle variable-length or parametric action spaces.
Serving and Offline
-------------------
- `CartPole server <https://github.com/ray-project/ray/tree/master/python/ray/rllib/examples/serving>`__:
Example of online serving of predictions for a simple CartPole policy.
- `Saving experiences <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/saving_experiences.py>`__:
Example of how to externally generate experience batches in RLlib-compatible format.
Multi-Agent and Hierarchical
----------------------------
- `Two-step game <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/twostep_game.py>`__:
Example of the two-step game from the `QMIX paper <https://arxiv.org/pdf/1803.11485.pdf>`__.
- `Weight sharing between policies <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/multiagent_cartpole.py>`__:
Example of how to define weight-sharing layers between two different policies.
- `Multiple trainers <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/multiagent_two_trainers.py>`__:
Example of alternating training between two DQN and PPO trainers.
- `Hierarchical training <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/hierarchical_training.py>`__:
Example of hierarchical training using the multi-agent API.
Community Examples
------------------
- `Traffic Flow <https://berkeleyflow.readthedocs.io/en/latest/flow_setup.html>`__:
Example of optimizing mixed-autonomy traffic simulations with RLlib / multi-agent.
- `Roboschool / SageMaker <https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning/rl_roboschool_ray>`__:
Example of training robotic control policies in SageMaker with RLlib.
- `StarCraft2 <https://github.com/oxwhirl/smac>`__:
Example of training in StarCraft2 maps with RLlib / multi-agent.
- `Sequential Social Dilemma Games <https://github.com/eugenevinitsky/sequential_social_dilemma_games>`__:
Example of using the multi-agent API to model several `social dilemma games <https://arxiv.org/abs/1702.03037>`__.

View file

@ -4,12 +4,17 @@ RLlib Offline Datasets
Working with Offline Datasets
-----------------------------
RLlib's I/O APIs enable you to work with datasets of experiences read from offline storage (e.g., disk, cloud storage, streaming systems, HDFS). For example, you might want to read experiences saved from previous training runs, or gathered from policies deployed in `web applications <https://arxiv.org/abs/1811.00260>`__. You can also log new agent experiences produced during online training for future use.
RLlib's offline dataset APIs enable working with experiences read from offline storage (e.g., disk, cloud storage, streaming systems, HDFS). For example, you might want to read experiences saved from previous training runs, or gathered from policies deployed in `web applications <https://arxiv.org/abs/1811.00260>`__. You can also log new agent experiences produced during online training for future use.
RLlib represents trajectory sequences (i.e., ``(s, a, r, s', ...)`` tuples) with `SampleBatch <https://github.com/ray-project/ray/blob/master/python/ray/rllib/evaluation/sample_batch.py>`__ objects. Using a batch format enables efficient encoding and compression of experiences. During online training, RLlib uses `policy evaluation <rllib-concepts.html#policy-evaluation>`__ actors to generate batches of experiences in parallel using the current policy. RLlib also uses this same batch format for reading and writing experiences to offline storage.
Example: Training on previously saved experiences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::
For custom models and enviroments, you'll need to use the `Python API <rllib-training.html#python-api>`__.
In this example, we will save batches of experiences generated during online training to disk, and then leverage this saved data to train a policy offline using DQN. First, we run a simple policy gradient algorithm for 100k steps with ``"output": "/tmp/cartpole-out"`` to tell RLlib to write simulation outputs to the ``/tmp/cartpole-out`` directory.
.. code-block:: bash

View file

@ -110,7 +110,7 @@ Python API
The Python API provides the needed flexibility for applying RLlib to new problems. You will need to use this API if you wish to use `custom environments, preprocessors, or models <rllib-models.html>`__ with RLlib.
Here is an example of the basic usage:
Here is an example of the basic usage (for a more complete example, see `custom_env.py <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/custom_env.py>`__):
.. code-block:: python
@ -175,6 +175,11 @@ Tune will schedule the trials to run in parallel on your Ray cluster:
- PPO_CartPole-v0_0_lr=0.01: RUNNING [pid=21940], 16 s, 4013 ts, 22 rew
- PPO_CartPole-v0_1_lr=0.001: RUNNING [pid=21942], 27 s, 8111 ts, 54.7 rew
Custom Training Workflows
~~~~~~~~~~~~~~~~~~~~~~~~~
In the `basic training example <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/custom_env.py>`__, Tune will call ``train()`` on your agent once per iteration and report the new training results. Sometimes, it is desirable to have full control over training, but still run inside Tune. Tune supports `custom trainable functions <tune-usage.html#training-api>`__ that can be used to implement `custom training workflows (example) <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/custom_train_fn.py>`__.
Accessing Policy State
~~~~~~~~~~~~~~~~~~~~~~
It is common to need to access an agent's internal state, e.g., to set or get internal weights. In RLlib an agent's state is replicated across multiple *policy evaluators* (Ray actors) in the cluster. However, you can easily get and update this state between calls to ``train()`` via ``agent.optimizer.foreach_evaluator()`` or ``agent.optimizer.foreach_evaluator_with_index()``. These functions take a lambda function that is applied with the evaluator as an arg. You can also return values from these functions and those will be returned as a list.

View file

@ -6,6 +6,7 @@ RLlib is an open-source library for reinforcement learning that offers both a co
.. image:: rllib-stack.svg
Learn more about RLlib's design by reading the `ICML paper <https://arxiv.org/abs/1712.09381>`__.
To get started, take a look over the `custom env example <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/custom_env.py>`__ and the `API documentation <rllib-training.html>`__.
Installation
------------
@ -118,6 +119,11 @@ Package Reference
* `ray.rllib.optimizers <rllib-package-ref.html#module-ray.rllib.optimizers>`__
* `ray.rllib.utils <rllib-package-ref.html#module-ray.rllib.utils>`__
Examples
--------
You can find an index of RLlib code examples on `this page <rllib-examples.html>`__. This includes tuned hyperparameters, demo scripts on how to use specific features of RLlib, and several community examples of applications built on RLlib.
Troubleshooting
---------------

View file

@ -130,7 +130,7 @@ COMMON_CONFIG = {
# Drop metric batches from unresponsive workers after this many seconds
"collect_metrics_timeout": 180,
# === Offline Data Input / Output ===
# === Offline Datasets ===
# __sphinx_doc_input_begin__
# Specify how to generate experiences:
# - "sampler": generate experiences via online simulation (default)

View file

@ -1,4 +1,11 @@
"""Example of a custom gym environment. Run this for a demo."""
"""Example of a custom gym environment. Run this for a demo.
This example shows:
- using a custom environment
- using Tune for grid search
You can visualize experiment results in ~/ray_results using TensorBoard.
"""
from __future__ import absolute_import
from __future__ import division
@ -7,10 +14,9 @@ from __future__ import print_function
import numpy as np
import gym
from gym.spaces import Discrete, Box
from gym.envs.registration import EnvSpec
import ray
from ray.tune import run_experiments
from ray.tune import run_experiments, grid_search
class SimpleCorridor(gym.Env):
@ -24,7 +30,6 @@ class SimpleCorridor(gym.Env):
self.action_space = Discrete(2)
self.observation_space = Box(
0.0, self.end_pos, shape=(1, ), dtype=np.float32)
self._spec = EnvSpec("SimpleCorridor-{}-v0".format(self.end_pos))
def reset(self):
self.cur_pos = 0
@ -48,7 +53,12 @@ if __name__ == "__main__":
"demo": {
"run": "PPO",
"env": SimpleCorridor, # or "corridor" if registered above
"stop": {
"timesteps_total": 10000,
},
"config": {
"lr": grid_search([1e-2, 1e-4, 1e-6]), # try different lrs
"num_workers": 1, # parallelism
"env_config": {
"corridor_length": 5,
},

View file

@ -0,0 +1,54 @@
"""Example of a custom training workflow. Run this for a demo.
This example shows:
- using Tune trainable functions to implement custom training workflows
You can visualize experiment results in ~/ray_results using TensorBoard.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import ray
from ray.rllib.agents.ppo import PPOAgent
from ray.tune import run_experiments
def my_train_fn(config, reporter):
# Train for 100 iterations with high LR
agent1 = PPOAgent(env="CartPole-v0", config=config)
for _ in range(10):
result = agent1.train()
result["phase"] = 1
reporter(**result)
phase1_time = result["timesteps_total"]
state = agent1.save()
agent1.stop()
# Train for 100 iterations with low LR
config["lr"] = 0.0001
agent2 = PPOAgent(env="CartPole-v0", config=config)
agent2.restore(state)
for _ in range(10):
result = agent2.train()
result["phase"] = 2
result["timesteps_total"] += phase1_time # keep time moving forward
reporter(**result)
agent2.stop()
if __name__ == "__main__":
ray.init()
run_experiments({
"demo": {
"run": my_train_fn,
"resources_per_trial": {
"cpu": 1,
},
"config": {
"lr": 0.01,
"num_workers": 0,
},
},
})

View file

@ -0,0 +1,233 @@
"""Example of hierarchical training using the multi-agent API.
The example env is that of a "windy maze". The agent observes the current wind
direction and can either choose to stand still, or move in that direction.
You can try out the env directly with:
$ python hierarchical_training.py --flat
A simple hierarchical formulation involves a high-level agent that issues goals
(i.e., go north / south / east / west), and a low-level agent that executes
these goals over a number of time-steps. This can be implemented as a
multi-agent environment with a top-level agent and low-level agents spawned
for each higher-level action. The lower level agent is rewarded for moving
in the right direction.
You can try this formulation with:
$ python hierarchical_training.py # gets ~100 rew after ~100k timesteps
Note that the hierarchical formulation actually converges slightly slower than
using --flat in this example.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import random
import gym
from gym.spaces import Box, Discrete, Tuple
import logging
import ray
from ray.tune import run_experiments, function
from ray.rllib.env import MultiAgentEnv
from ray.rllib.agents.ppo import PPOAgent
parser = argparse.ArgumentParser()
parser.add_argument("--flat", action="store_true")
# Agent has to traverse the maze from the starting position S -> F
# Observation space [x_pos, y_pos, wind_direction]
# Action space: stay still OR move in current wind direction
MAP_DATA = """
#########
#S #
####### #
# #
# #
####### #
#F #
#########"""
logger = logging.getLogger(__name__)
class WindyMazeEnv(gym.Env):
def __init__(self, env_config):
self.map = [m for m in MAP_DATA.split("\n") if m]
self.x_dim = len(self.map)
self.y_dim = len(self.map[0])
logger.info("Loaded map {} {}".format(self.x_dim, self.y_dim))
for x in range(self.x_dim):
for y in range(self.y_dim):
if self.map[x][y] == "S":
self.start_pos = (x, y)
elif self.map[x][y] == "F":
self.end_pos = (x, y)
logger.info("Start pos {} end pos {}".format(self.start_pos,
self.end_pos))
self.observation_space = Tuple([
Box(0, 100, shape=(2, )), # (x, y)
Discrete(4), # wind direction (N, E, S, W)
])
self.action_space = Discrete(2) # whether to move or not
def reset(self):
self.wind_direction = random.choice([0, 1, 2, 3])
self.pos = self.start_pos
self.num_steps = 0
return [[self.pos[0], self.pos[1]], self.wind_direction]
def step(self, action):
if action == 1:
self.pos = self._get_new_pos(self.pos, self.wind_direction)
self.num_steps += 1
self.wind_direction = random.choice([0, 1, 2, 3])
at_goal = self.pos == self.end_pos
done = at_goal or self.num_steps >= 200
return ([[self.pos[0], self.pos[1]], self.wind_direction],
100 * int(at_goal), done, {})
def _get_new_pos(self, pos, direction):
if direction == 0:
new_pos = (pos[0] - 1, pos[1])
elif direction == 1:
new_pos = (pos[0], pos[1] + 1)
elif direction == 2:
new_pos = (pos[0] + 1, pos[1])
elif direction == 3:
new_pos = (pos[0], pos[1] - 1)
if (new_pos[0] >= 0 and new_pos[0] < self.x_dim and new_pos[1] >= 0
and new_pos[1] < self.y_dim
and self.map[new_pos[0]][new_pos[1]] != "#"):
return new_pos
else:
return pos # did not move
class HierarchicalWindyMazeEnv(MultiAgentEnv):
def __init__(self, env_config):
self.flat_env = WindyMazeEnv(env_config)
def reset(self):
self.cur_obs = self.flat_env.reset()
self.current_goal = None
self.steps_remaining_at_level = None
self.num_high_level_steps = 0
# current low level agent id. This must be unique for each high level
# step since agent ids cannot be reused.
self.low_level_agent_id = "low_level_{}".format(
self.num_high_level_steps)
return {
"high_level_agent": self.cur_obs,
}
def step(self, action_dict):
assert len(action_dict) == 1, action_dict
if "high_level_agent" in action_dict:
return self._high_level_step(action_dict["high_level_agent"])
else:
return self._low_level_step(list(action_dict.values())[0])
def _high_level_step(self, action):
logger.debug("High level agent sets goal".format(action))
self.current_goal = action
self.steps_remaining_at_level = 25
self.num_high_level_steps += 1
self.low_level_agent_id = "low_level_{}".format(
self.num_high_level_steps)
obs = {self.low_level_agent_id: [self.cur_obs, self.current_goal]}
rew = {self.low_level_agent_id: 0}
done = {"__all__": False}
return obs, rew, done, {}
def _low_level_step(self, action):
logger.debug("Low level agent step {}".format(action))
self.steps_remaining_at_level -= 1
cur_pos = tuple(self.cur_obs[0])
goal_pos = self.flat_env._get_new_pos(cur_pos, self.current_goal)
# Step in the actual env
f_obs, f_rew, f_done, _ = self.flat_env.step(action)
new_pos = tuple(f_obs[0])
self.cur_obs = f_obs
# Calculate low-level agent observation and reward
obs = {self.low_level_agent_id: [f_obs, self.current_goal]}
if new_pos != cur_pos:
if new_pos == goal_pos:
rew = {self.low_level_agent_id: 1}
else:
rew = {self.low_level_agent_id: -1}
else:
rew = {self.low_level_agent_id: 0}
# Handle env termination & transitions back to higher level
done = {"__all__": False}
if f_done:
done["__all__"] = True
logger.debug("high level final reward {}".format(f_rew))
rew["high_level_agent"] = f_rew
obs["high_level_agent"] = f_obs
elif self.steps_remaining_at_level == 0:
done[self.low_level_agent_id] = True
rew["high_level_agent"] = 0
obs["high_level_agent"] = f_obs
return obs, rew, done, {}
if __name__ == "__main__":
args = parser.parse_args()
ray.init()
if args.flat:
run_experiments({
"maze_single": {
"run": "PPO",
"env": WindyMazeEnv,
"config": {
"num_workers": 0,
},
},
})
else:
maze = WindyMazeEnv(None)
def policy_mapping_fn(agent_id):
if agent_id.startswith("low_level_"):
return "low_level_policy"
else:
return "high_level_policy"
run_experiments({
"maze_hier": {
"run": "PPO",
"env": HierarchicalWindyMazeEnv,
"config": {
"num_workers": 0,
"log_level": "INFO",
"entropy_coeff": 0.01,
"multiagent": {
"policy_graphs": {
"high_level_policy": (PPOAgent._policy_graph,
maze.observation_space,
Discrete(4), {
"gamma": 0.9
}),
"low_level_policy": (PPOAgent._policy_graph,
Tuple([
maze.observation_space,
Discrete(4)
]), maze.action_space, {
"gamma": 0.0
}),
},
"policy_mapping_fn": function(policy_mapping_fn),
},
},
},
})

View file

@ -1,18 +0,0 @@
StarCraft on RLlib
==================
This builds off the StarCraft env in https://github.com/oxwhirl/pymarl_alpha.
Temporary instructions
----------------------
To install, run
```
git clone https://github.com/oxwhirl/pymarl_alpha
mv pymarl_alpha ~/pymarl
cd ~/pymarl
install_sc1.sh
install_sc2.sh
export PYMARL_PATH="~/pymarl"
```

View file

@ -1,32 +0,0 @@
## Adapted from `https://github.com/oxwhirl/pymarl_alpha`.
env: sc2
env_args:
map_name: "3m_3m" # SC2 map name
difficulty: "7" # Very hard
move_amount: 2 # How much units are ordered to move per step
step_mul: 8 # How many frames are skiped per step
reward_sparse: False # Only +1/-1 reward for win/defeat (the rest of reward configs are ignored if True)
reward_only_positive: True # Reward is always positive
reward_negative_scale: 0.5 # How much to scale negative rewards, ignored if reward_only_positive=True
reward_death_value: 10 # Reward for killing an enemy unit and penalty for having an allied unit killed (if reward_only_poitive=False)
reward_scale: True # Whether or not to scale rewards before returning to agents
reward_scale_rate: 20 # If reward_scale=True, the agents receive the reward of (max_reward / reward_scale_rate), where max_reward is the maximum possible reward per episode
reward_win: 200 # Reward for win
reward_defeat: 0 # Reward for defeat (should be nonpositive)
state_last_action: True # Whether the last actions of units is a part of the state
obs_instead_of_state: False # Use combination of all agnets' observations as state
obs_own_health: True # Whether agents receive their own health as a part of observation
obs_all_health: True # Whether agents receive the health of all units (in the sight range) as a part of observataion
continuing_episode: False # Stop/continue episode after its termination
game_version: "4.1.2" # Ignored for Mac/Windows
save_replay_prefix: "" # Prefix of the replay to be saved
heuristic: False # Whether or not use a simple nonlearning hearistic as a controller
test_nepisode: 32
test_interval: 10000
log_interval: 2000
runner_log_interval: 2000
learner_log_interval: 2000
t_max: 2000000

View file

@ -1,153 +0,0 @@
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
from gym.spaces import Discrete, Box, Dict, Tuple
import os
import sys
import tensorflow as tf
import tensorflow.contrib.slim as slim
import yaml
import ray
from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ray.tune.registry import register_env
from ray.rllib.models import Model, ModelCatalog
from ray.rllib.models.misc import normc_initializer
from ray.rllib.agents.qmix import QMixAgent
from ray.rllib.agents.pg import PGAgent
from ray.rllib.agents.ppo import PPOAgent
from ray.tune.logger import pretty_print
class MaskedActionsModel(Model):
def _build_layers_v2(self, input_dict, num_outputs, options):
action_mask = input_dict["obs"]["action_mask"]
if num_outputs != action_mask.shape[1].value:
raise ValueError(
"This model assumes num outputs is equal to max avail actions",
num_outputs, action_mask)
# Standard FC net component.
last_layer = input_dict["obs"]["obs"]
hiddens = [256, 256]
for i, size in enumerate(hiddens):
label = "fc{}".format(i)
last_layer = slim.fully_connected(
last_layer,
size,
weights_initializer=normc_initializer(1.0),
activation_fn=tf.nn.tanh,
scope=label)
action_logits = slim.fully_connected(
last_layer,
num_outputs,
weights_initializer=normc_initializer(0.01),
activation_fn=None,
scope="fc_out")
# Mask out invalid actions (use tf.float32.min for stability)
inf_mask = tf.maximum(tf.log(action_mask), tf.float32.min)
masked_logits = inf_mask + action_logits
return masked_logits, last_layer
class SC2MultiAgentEnv(MultiAgentEnv):
"""RLlib Wrapper around StarCraft2."""
def __init__(self, override_cfg):
PYMARL_PATH = override_cfg.pop("pymarl_path")
os.environ["SC2PATH"] = os.path.join(PYMARL_PATH,
"3rdparty/StarCraftII")
sys.path.append(os.path.join(PYMARL_PATH, "src"))
from envs.starcraft2 import StarCraft2Env
curpath = os.path.dirname(os.path.abspath(__file__))
with open(os.path.join(curpath, "sc2.yaml")) as f:
pymarl_args = yaml.load(f)
pymarl_args.update(override_cfg)
pymarl_args["env_args"].setdefault("seed", 0)
self._starcraft_env = StarCraft2Env(**pymarl_args)
obs_size = self._starcraft_env.get_obs_size()
num_actions = self._starcraft_env.get_total_actions()
self.observation_space = Dict({
"action_mask": Box(0, 1, shape=(num_actions, )),
"obs": Box(-1, 1, shape=(obs_size, ))
})
self.action_space = Discrete(self._starcraft_env.get_total_actions())
def reset(self):
obs_list, state_list = self._starcraft_env.reset()
return_obs = {}
for i, obs in enumerate(obs_list):
return_obs[i] = {
"action_mask": self._starcraft_env.get_avail_agent_actions(i),
"obs": obs
}
return return_obs
def step(self, action_dict):
# TODO(rliaw): Check to handle missing agents, if any
actions = [action_dict[k] for k in sorted(action_dict)]
rew, done, info = self._starcraft_env.step(actions)
obs_list = self._starcraft_env.get_obs()
return_obs = {}
for i, obs in enumerate(obs_list):
return_obs[i] = {
"action_mask": self._starcraft_env.get_avail_agent_actions(i),
"obs": obs
}
rews = {i: rew / len(obs_list) for i in range(len(obs_list))}
dones = {i: done for i in range(len(obs_list))}
dones["__all__"] = done
infos = {i: info for i in range(len(obs_list))}
return return_obs, rews, dones, infos
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--num-iters", type=int, default=100)
parser.add_argument("--run", type=str, default="qmix")
args = parser.parse_args()
path_to_pymarl = os.environ.get("PYMARL_PATH",
os.path.expanduser("~/pymarl/"))
ray.init()
ModelCatalog.register_custom_model("mask_model", MaskedActionsModel)
register_env("starcraft", lambda cfg: SC2MultiAgentEnv(cfg))
agent_cfg = {
"observation_filter": "NoFilter",
"num_workers": 4,
"model": {
"custom_model": "mask_model",
},
"env_config": {
"pymarl_path": path_to_pymarl
}
}
if args.run.lower() == "qmix":
def grouped_sc2(cfg):
env = SC2MultiAgentEnv(cfg)
agent_list = list(range(env._starcraft_env.n_agents))
grouping = {
"group_1": agent_list,
}
obs_space = Tuple([env.observation_space for i in agent_list])
act_space = Tuple([env.action_space for i in agent_list])
return env.with_agent_groups(
grouping, obs_space=obs_space, act_space=act_space)
register_env("grouped_starcraft", grouped_sc2)
agent = QMixAgent(env="grouped_starcraft", config=agent_cfg)
elif args.run.lower() == "pg":
agent = PGAgent(env="starcraft", config=agent_cfg)
elif args.run.lower() == "ppo":
agent_cfg.update({"vf_share_layers": True})
agent = PPOAgent(env="starcraft", config=agent_cfg)
for i in range(args.num_iters):
print(pretty_print(agent.train()))