Each rollout always only contains **full** episodes (from beginning to terminal), never any episode fragments. The number of episodes in the rollout is 1 or larger.
The ``rollout_fragment_length`` setting defines the minimum number of
timesteps that will be covered in the rollout.
For example, if ``rollout_fragment_length=100`` and your episodes are always 98 timesteps long, then rollouts will happen over two complete episodes and always be 196 timesteps long: 98 < 100 -> too short, keep rollout going; 98+98 >= 100 -> good, stop rollout after 2 episodes (196 timesteps).
Note that you have to be careful when choosing ``complete_episodes`` as batch_mode: If your environment does not
Within the Trainer's ``multiagent`` config dict, you can set the unit, by which RLlib will count a) rollout fragment lengths as well as b) the size of the final train_batch (see below). The two supported values are:
*env_steps (default)*:
Each call to ``[Env].step()`` is counted as one. It does not
matter, how many individual agents are stepping simultaneously in this very call
(not all existing agents in the environment may step at the same time).
*agent_steps*:
In a multi-agent environment, count each individual agent's step
as one. For example, if N agents are in an environment and all these N agents
always step at the same time, a single env step corresponds to N agent steps.
Note that in the single-agent case, ``env_steps`` and ``agent_steps`` are the same thing.
**horizon [int]**:
Some environments are limited by default in the number of maximum timesteps
an episode can last. This limit is called the "horizon" of an episode.
For example, for CartPole-v0, the maximum number of steps per episode is 200 by default.
You can overwrite this setting, however, by using the ``horizon`` config.
If provided, RLlib will first try to increase the environment's built-in horizon
setting (e.g. openAI gym Envs have a ``spec.max_episode_steps`` property), if the user
provided horizon is larger than this env-specific setting. In either case, no episode
is allowed to exceed the given ``horizon`` number of timesteps (RLlib will
artificially terminate an episode if this limit is hit).
**soft_horizon [bool]**:
False by default. If set to True, the environment will
a) not be reset when reaching ``horizon`` and b) no ``done=True`` will be set
in the trajectory data sent to the postprocessors and training (``done`` will remain
False at the horizon).
**no_done_at_end [bool]**:
Never set ``done=True``, at the end of an episode or when any
artificial horizon is reached.
To trigger a single rollout, RLlib calls ``RolloutWorker.sample()``, which returns
a SampleBatch or MultiAgentBatch object representing all the data collected during that
rollout. These batches are then usually further concatenated (from the ``num_workers``
parallelized RolloutWorkers) to form a final train batch. The size of that train batch is determined
by the ``train_batch_size`` config parameter. Train batches are usually sent to the Policy's
``learn_on_batch`` method, which handles loss- and gradient calculations, and optimizer stepping.
RLlib's default ``SampleCollector`` class is the ``SimpleListCollector``, which appends single timestep data (e.g. actions)
to lists, then builds SampleBatches from these and sends them to the downstream processing functions.
It thereby tries to avoid collecting duplicate data separately (OBS and NEXT_OBS use the same underlying list).
If you want to implement your own collection logic and data structures, you can sub-class ``SampleCollector``
and specify that new class under the Trainer's "sample_collector" config key.
Let's now look at how the Policy's Model lets the RolloutWorker and its SampleCollector
know, what data in the ongoing episode/trajectory to use for the different required method calls
during rollouts. These method calls in particular are:
``Policy.compute_actions_from_input_dict()`` to compute actions to be taken in an episode.
``Policy.postprocess_trajectory()``, which is called after an episode ends or a rollout hit its
``rollout_fragment_length`` limit (in ``batch_mode=truncated_episodes``), and ``Policy.learn_on_batch()``,
which is called with a "train_batch" to improve the policy.
Trajectory View API
-------------------
The trajectory view API allows custom models to define what parts of the trajectory they
require in order to execute the forward pass. For example, in the simplest case, a model might
only look at the latest observation. However, an RNN- or attention based model could look
at previous states emitted by the model, concatenate previously seen rewards with the current observation,
or require the entire range of the n most recent observations.
The trajectory view API lets models define these requirements and lets RLlib gather the required
data for the forward pass in an efficient way.
Since the following methods all call into the model class, they are all indirectly using the trajectory view API.
It is important to note that the API is only accessible to the user via the model classes
(see below on how to setup trajectory view requirements for a custom model).
In particular, the methods receiving inputs that depend on a Model's trajectory view rules are:
a) ``Policy.compute_actions_from_input_dict()``
b) ``Policy.postprocess_trajectory()`` and
c)``Policy.learn_on_batch()`` (and consecutively: the Policy's loss function).
The input data to these methods can stem from either the environment (observations, rewards, and env infos),
the model itself (previously computed actions, internal state outputs, action-probs, etc..)
or the Sampler (e.g. agent index, env ID, episode ID, timestep, etc..).
All data has an associated time axis, which is 0-based, meaning that the first action taken, the
first reward received in an episode, and the first observation (directly after a reset)
all have t=0.
The idea is to allow more flexibility and standardization in how a model defines required
"views" on the ongoing trajectory (during action computations/inference), past episodes (training
on a batch), or even trajectories of other agents in the same episode, some of which
may even use a different policy.
Such a "view requirements" formalism is helpful when having to support more complex model
setups like RNNs, attention nets, observation image framestacking (e.g. for Atari),
and building multi-agent communication channels.
The way to define a set of rules used for making the Model see certain
data is through a "view requirements dict", residing in the ``Policy.model.view_requirements``
property.
View requirements dicts map strings (column names), such as "obs" or "actions" to
a ``ViewRequirement`` object, which defines the exact conditions by which this column
should be populated with data.
View Requirement Dictionaries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
View requirements are stored within the ``view_requirements`` property of the ``ModelV2``