RLlib Environments
==================
RLlib works with several different types of environments, including `OpenAI Gym `__, user-defined, multi-agent, and also batched environments.
.. image:: rllib-envs.svg
In the high-level agent APIs, environments are identified with string names. By default, the string will be interpreted as a gym `environment name `__, however you can also register custom environments by name:
.. code-block:: python
import ray
from ray.tune.registry import register_env
from ray.rllib import ppo
def env_creator(env_config):
import gym
return gym.make("CartPole-v0") # or return your own custom env
register_env("my_env", env_creator)
ray.init()
trainer = ppo.PPOAgent(env="my_env", config={
"env_config": {}, # config to pass to env creator
})
while True:
print(trainer.train())
OpenAI Gym
----------
RLlib uses Gym as its environment interface for single-agent training. For more information on how to implement a custom Gym environment, see the `gym.Env class definition `__. You may also find the `SimpleCorridor `__ and `Carla simulator `__ example env implementations useful as a reference.
Performance
~~~~~~~~~~~
There are two ways to scale experience collection with Gym environments:
1. **Vectorization within a single process:** Though many envs can very achieve high frame rates per core, their throughput is limited in practice by policy evaluation between steps. For example, even small TensorFlow models incur a couple milliseconds of latency to evaluate. This can be worked around by creating multiple envs per process and batching policy evaluations across these envs.
You can configure ``{"num_envs_per_worker": M}`` to have RLlib create ``M`` concurrent environments per worker. RLlib auto-vectorizes Gym environments via `VectorEnv.wrap() `__.
2. **Distribute across multiple processes:** You can also have RLlib create multiple processes (Ray actors) for experience collection. In most algorithms this can be controlled by setting the ``{"num_workers": N}`` config.
.. image:: throughput.png
You can also combine vectorization and distributed execution, as shown in the above figure. Here we plot just the throughput of RLlib policy evaluation from 1 to 128 CPUs. PongNoFrameskip-v4 on GPU scales from 2.4k to ∼200k actions/s, and Pendulum-v0 on CPU from 15k to 1.5M actions/s. One machine was used for 1-16 workers, and a Ray cluster of four machines for 32-128 workers. Each worker was configured with ``num_envs_per_worker=64``.
Vectorized
----------
RLlib will auto-vectorize Gym envs for batch evaluation if the ``num_envs_per_worker`` config is set, or you can define a custom environment class that subclasses `VectorEnv `__ to implement ``vector_step()`` and ``vector_reset()``.
Multi-Agent
-----------
A multi-agent environment is one which has multiple acting entities per step, e.g., in a traffic simulation, there may be multiple "car" and "traffic light" agents in the environment. The model for multi-agent in RLlib as follows: (1) as a user you define the number of policies available up front, and (2) a function that maps agent ids to policy ids. This is summarized by the below figure:
.. image:: multi-agent.svg
The environment itself must subclass the `MultiAgentEnv `__ interface, which can returns observations and rewards from multiple ready agents per step:
.. code-block:: python
# Example: using a multi-agent env
> env = MultiAgentTrafficEnv(num_cars=20, num_traffic_lights=5)
# Observations are a dict mapping agent names to their obs. Not all agents
# may be present in the dict in each time step.
> print(env.reset())
{
"car_1": [[...]],
"car_2": [[...]],
"traffic_light_1": [[...]],
}
# Actions should be provided for each agent that returned an observation.
> new_obs, rewards, dones, infos = env.step(actions={"car_1": ..., "car_2": ...})
# Similarly, new_obs, rewards, dones, etc. also become dicts
> print(rewards)
{"car_1": 3, "car_2": -1, "traffic_light_1": 0}
# Individual agents can early exit; env is done when "__all__" = True
> print(dones)
{"car_2": True, "__all__": False}
If all the agents will be using the same algorithm class to train, then you can setup multi-agent training as follows:
.. code-block:: python
trainer = pg.PGAgent(env="my_multiagent_env", config={
"multiagent": {
"policy_graphs": {
"car1": (PGPolicyGraph, car_obs_space, car_act_space, {"gamma": 0.85}),
"car2": (PGPolicyGraph, car_obs_space, car_act_space, {"gamma": 0.99}),
"traffic_light": (PGPolicyGraph, tl_obs_space, tl_act_space, {}),
},
"policy_mapping_fn":
lambda agent_id:
"traffic_light" # Traffic lights are always controlled by this policy
if agent_id.startswith("traffic_light_")
else random.choice(["car1", "car2"]) # Randomly choose from car policies
},
},
})
while True:
print(trainer.train())
RLlib will create three distinct policies and route agent decisions to its bound policy. When an agent first appears in the env, ``policy_mapping_fn`` will be called to determine which policy it is bound to. RLlib reports separate training statistics for each policy in the return from ``train()``, along with the combined reward.
Here is a simple `example training script `__ in which you can vary the number of agents and policies in the environment. For how to use multiple training methods at once (here DQN and PPO), see the `two-trainer example `__.
To scale to hundreds of agents, MultiAgentEnv batches policy evaluations across multiple agents internally. It can also be auto-vectorized by setting ``num_envs_per_worker > 1``.
Serving
-------
In many situations, it does not make sense for an environment to be "stepped" by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural to instead *query* a service that serves policy decisions, and for that service to learn from experience over time.
RLlib provides the `ServingEnv `__ class for this purpose. Unlike other envs, ServingEnv runs as its own thread of control. At any point, that thread can query the current policy for decisions via ``self.get_action()`` and reports rewards via ``self.log_returns()``. This can be done for multiple concurrent episodes as well.
For example, ServingEnv can be used to implement a simple REST policy `server `__ that learns over time using RLlib. In this example RLlib runs with ``num_workers=0`` to avoid port allocation issues, but in principle this could be scaled by increasing ``num_workers``.
Offline Data
~~~~~~~~~~~~
ServingEnv also provides a ``self.log_action()`` call to support off-policy actions. This allows the client to make independent decisions, e.g., to compare two different policies, and for RLlib to still learn from those off-policy actions. Note that this requires the algorithm used to support learning from off-policy decisions (e.g., DQN).
The ``log_action`` API of ServingEnv can be used to ingest data from offline logs. The pattern would be as follows: First, some policy is followed to produce experience data which is stored in some offline storage system. Then, RLlib creates a number of workers that use a ServingEnv to read the logs in parallel and ingest the experiences. After a round of training completes, the new policy can be deployed to collect more experiences.
Note that envs can read from different partitions of the logs based on the ``worker_index`` attribute of the `env context `__ passed into the environment constructor.
Batch Asynchronous
------------------
The lowest-level "catch-all" environment supported by RLlib is `AsyncVectorEnv `__. AsyncVectorEnv models multiple agents executing asynchronously in multiple environments. A call to ``poll()`` returns observations from ready agents keyed by their environment and agent ids, and actions for those agents can be sent back via ``send_actions()``. This interface can be subclassed directly to support batched simulators such as `ELF `__.
Under the hood, all other envs are converted to AsyncVectorEnv by RLlib so that there is a common internal path for policy evaluation.