.. TODO: Deprecate this page (moved here from the old index (rllib in 60s) page). .. _rllib-core-concepts: RLlib Core Concepts =================== In this section, we'll cover three key concepts in RLlib: Policies, Samples, and Trainers. Policies -------- `Policies `__ are a core concept in RLlib. In a nutshell, policies are Python classes that define how an agent acts in an environment. `Rollout workers `__ query the policy to determine agent actions. In a `gym `__ environment, there is a single agent and policy. In `vector envs `__, policy inference is for multiple agents at once, and in `multi-agent `__, there may be multiple policies, each controlling one or more agents: .. image:: ../multi-flat.svg Policies can be implemented using `any framework `__. However, for TensorFlow and PyTorch, RLlib has `build_tf_policy `__ and `build_torch_policy `__ helper functions that let you define a trainable policy with a functional-style API, for example: .. code-block:: python def policy_gradient_loss(policy, model, dist_class, train_batch): logits, _ = model.from_batch(train_batch) action_dist = dist_class(logits, model) return -tf.reduce_mean( action_dist.logp(train_batch["actions"]) * train_batch["rewards"]) # MyTFPolicy = build_tf_policy( name="MyTFPolicy", loss_fn=policy_gradient_loss) Sample Batches -------------- Whether running in a single process or `large cluster `__, all data interchange in RLlib is in the form of `sample batches `__. Sample batches encode one or more fragments of a trajectory. Typically, RLlib collects batches of size ``rollout_fragment_length`` from rollout workers, and concatenates one or more of these batches into a batch of size ``train_batch_size`` that is the input to SGD. A typical sample batch looks something like the following when summarized. Since all values are kept in arrays, this allows for efficient encoding and transmission across the network: .. code-block:: python { 'action_logp': np.ndarray((200,), dtype=float32, min=-0.701, max=-0.685, mean=-0.694), 'actions': np.ndarray((200,), dtype=int64, min=0.0, max=1.0, mean=0.495), 'dones': np.ndarray((200,), dtype=bool, min=0.0, max=1.0, mean=0.055), 'infos': np.ndarray((200,), dtype=object, head={}), 'new_obs': np.ndarray((200, 4), dtype=float32, min=-2.46, max=2.259, mean=0.018), 'obs': np.ndarray((200, 4), dtype=float32, min=-2.46, max=2.259, mean=0.016), 'rewards': np.ndarray((200,), dtype=float32, min=1.0, max=1.0, mean=1.0), 't': np.ndarray((200,), dtype=int64, min=0.0, max=34.0, mean=9.14)} In `multi-agent mode `__, sample batches are collected separately for each individual policy. Training -------- Policies each define a ``learn_on_batch()`` method that improves the policy given a sample batch of input. For TF and Torch policies, this is implemented using a `loss function` that takes as input sample batch tensors and outputs a scalar loss. Here are a few example loss functions: - Simple `policy gradient loss `__ - Simple `Q-function loss `__ - Importance-weighted `APPO surrogate loss `__ RLlib `Trainer classes `__ coordinate the distributed workflow of running rollouts and optimizing policies. Trainer classes leverage parallel iterators to implement the desired computation pattern. The following figure shows *synchronous sampling*, the simplest of `these patterns `__: .. figure:: ../a2c-arch.svg Synchronous Sampling (e.g., A2C, PG, PPO) RLlib uses `Ray actors `__ to scale training from a single core to many thousands of cores in a cluster. You can `configure the parallelism `__ used for training by changing the ``num_workers`` parameter. Check out our `scaling guide `__ for more details here.