RLlib Environments ================== RLlib works with several different types of environments, including `OpenAI Gym `__, user-defined, multi-agent, and also batched environments. .. image:: rllib-envs.svg In the high-level agent APIs, environments are identified with string names. By default, the string will be interpreted as a gym `environment name `__, however you can also register custom environments by name: .. code-block:: python import ray from ray.tune.registry import register_env from ray.rllib.agents import ppo def env_creator(env_config): import gym return gym.make("CartPole-v0") # or return your own custom env register_env("my_env", env_creator) ray.init() trainer = ppo.PPOAgent(env="my_env", config={ "env_config": {}, # config to pass to env creator }) while True: print(trainer.train()) Configuring Environments ------------------------ In the above example, note that the ``env_creator`` function takes in an ``env_config`` object. This is a dict containing options passed in through your agent. You can also access ``env_config.worker_index`` and ``env_config.vector_index`` to get the worker id and env id within the worker (if ``num_envs_per_worker > 0``). This can be useful if you want to train over an ensemble of different environments, for example: .. code-block:: python class MultiEnv(gym.Env): def __init__(self, env_config): # pick actual env based on worker and env indexes self.env = gym.make( choose_env_for(env_config.worker_index, env_config.vector_index)) self.action_space = self.env.action_space self.observation_space = self.env.observation_space def reset(self): return self.env.reset() def step(self, action): return self.env.step(action) register_env("multienv", lambda config: MultiEnv(config)) OpenAI Gym ---------- RLlib uses Gym as its environment interface for single-agent training. For more information on how to implement a custom Gym environment, see the `gym.Env class definition `__. You may also find the `SimpleCorridor `__ and `Carla simulator `__ example env implementations useful as a reference. Performance ~~~~~~~~~~~ There are two ways to scale experience collection with Gym environments: 1. **Vectorization within a single process:** Though many envs can very achieve high frame rates per core, their throughput is limited in practice by policy evaluation between steps. For example, even small TensorFlow models incur a couple milliseconds of latency to evaluate. This can be worked around by creating multiple envs per process and batching policy evaluations across these envs. You can configure ``{"num_envs_per_worker": M}`` to have RLlib create ``M`` concurrent environments per worker. RLlib auto-vectorizes Gym environments via `VectorEnv.wrap() `__. 2. **Distribute across multiple processes:** You can also have RLlib create multiple processes (Ray actors) for experience collection. In most algorithms this can be controlled by setting the ``{"num_workers": N}`` config. .. image:: throughput.png You can also combine vectorization and distributed execution, as shown in the above figure. Here we plot just the throughput of RLlib policy evaluation from 1 to 128 CPUs. PongNoFrameskip-v4 on GPU scales from 2.4k to ∼200k actions/s, and Pendulum-v0 on CPU from 15k to 1.5M actions/s. One machine was used for 1-16 workers, and a Ray cluster of four machines for 32-128 workers. Each worker was configured with ``num_envs_per_worker=64``. Vectorized ---------- RLlib will auto-vectorize Gym envs for batch evaluation if the ``num_envs_per_worker`` config is set, or you can define a custom environment class that subclasses `VectorEnv `__ to implement ``vector_step()`` and ``vector_reset()``. Multi-Agent ----------- A multi-agent environment is one which has multiple acting entities per step, e.g., in a traffic simulation, there may be multiple "car" and "traffic light" agents in the environment. The model for multi-agent in RLlib as follows: (1) as a user you define the number of policies available up front, and (2) a function that maps agent ids to policy ids. This is summarized by the below figure: .. image:: multi-agent.svg The environment itself must subclass the `MultiAgentEnv `__ interface, which can returns observations and rewards from multiple ready agents per step: .. code-block:: python # Example: using a multi-agent env > env = MultiAgentTrafficEnv(num_cars=20, num_traffic_lights=5) # Observations are a dict mapping agent names to their obs. Not all agents # may be present in the dict in each time step. > print(env.reset()) { "car_1": [[...]], "car_2": [[...]], "traffic_light_1": [[...]], } # Actions should be provided for each agent that returned an observation. > new_obs, rewards, dones, infos = env.step(actions={"car_1": ..., "car_2": ...}) # Similarly, new_obs, rewards, dones, etc. also become dicts > print(rewards) {"car_1": 3, "car_2": -1, "traffic_light_1": 0} # Individual agents can early exit; env is done when "__all__" = True > print(dones) {"car_2": True, "__all__": False} If all the agents will be using the same algorithm class to train, then you can setup multi-agent training as follows: .. code-block:: python trainer = pg.PGAgent(env="my_multiagent_env", config={ "multiagent": { "policy_graphs": { "car1": (PGPolicyGraph, car_obs_space, car_act_space, {"gamma": 0.85}), "car2": (PGPolicyGraph, car_obs_space, car_act_space, {"gamma": 0.99}), "traffic_light": (PGPolicyGraph, tl_obs_space, tl_act_space, {}), }, "policy_mapping_fn": lambda agent_id: "traffic_light" # Traffic lights are always controlled by this policy if agent_id.startswith("traffic_light_") else random.choice(["car1", "car2"]) # Randomly choose from car policies }, }, }) while True: print(trainer.train()) RLlib will create three distinct policies and route agent decisions to its bound policy. When an agent first appears in the env, ``policy_mapping_fn`` will be called to determine which policy it is bound to. RLlib reports separate training statistics for each policy in the return from ``train()``, along with the combined reward. Here is a simple `example training script `__ in which you can vary the number of agents and policies in the environment. For how to use multiple training methods at once (here DQN and PPO), see the `two-trainer example `__. To scale to hundreds of agents, MultiAgentEnv batches policy evaluations across multiple agents internally. It can also be auto-vectorized by setting ``num_envs_per_worker > 1``. Agent-Driven ------------ In many situations, it does not make sense for an environment to be "stepped" by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural for an agent to query a service that serves policy decisions, and for that service to learn from experience over time. RLlib provides the `ServingEnv `__ class for this purpose. Unlike other envs, ServingEnv has its own thread of control. At any point, agents on that thread can query the current policy for decisions via ``self.get_action()`` and reports rewards via ``self.log_returns()``. This can be done for multiple concurrent episodes as well. For example, ServingEnv can be used to implement a simple REST policy `server `__ that learns over time using RLlib. In this example RLlib runs with ``num_workers=0`` to avoid port allocation issues, but in principle this could be scaled by increasing ``num_workers``. Offline Data ~~~~~~~~~~~~ ServingEnv also provides a ``self.log_action()`` call to support off-policy actions. This allows the client to make independent decisions, e.g., to compare two different policies, and for RLlib to still learn from those off-policy actions. Note that this requires the algorithm used to support learning from off-policy decisions (e.g., DQN). The ``log_action`` API of ServingEnv can be used to ingest data from offline logs. The pattern would be as follows: First, some policy is followed to produce experience data which is stored in some offline storage system. Then, RLlib creates a number of workers that use a ServingEnv to read the logs in parallel and ingest the experiences. After a round of training completes, the new policy can be deployed to collect more experiences. Note that envs can read from different partitions of the logs based on the ``worker_index`` attribute of the `env context `__ passed into the environment constructor. Batch Asynchronous ------------------ The lowest-level "catch-all" environment supported by RLlib is `AsyncVectorEnv `__. AsyncVectorEnv models multiple agents executing asynchronously in multiple environments. A call to ``poll()`` returns observations from ready agents keyed by their environment and agent ids, and actions for those agents can be sent back via ``send_actions()``. This interface can be subclassed directly to support batched simulators such as `ELF `__. Under the hood, all other envs are converted to AsyncVectorEnv by RLlib so that there is a common internal path for policy evaluation.