# Offline reinforcement learning with Ray AIR
In this example, we'll train a reinforcement learning agent using offline training.

Offline training means that the data from the environment (and the actions performed by the agent) have been stored on disk. In contrast, online training samples experiences live by interacting with the environment.

Let's start with installing our dependencies:

In [1]:
!pip install -qU "ray[rllib]" gym

Now we can run some imports:

In [2]:
import argparse
import gym
import os

import numpy as np
import ray
from ray.air import Checkpoint
from ray.air.config import RunConfig
from ray.train.rl.rl_predictor import RLPredictor
from ray.train.rl.rl_trainer import RLTrainer
from ray.air.config import ScalingConfig
from ray.air.result import Result
from ray.rllib.agents.marwil import BCTrainer
from ray.tune.tuner import Tuner



We will be training on offline data - this means we have full agent trajectories stored somewhere on disk and want to train on these past experiences.

Usually this data could come from external systems, or a database of historical data. But for this example, we'll generate some offline data ourselves and store it using RLlibs `output_config`.

In [3]:
def generate_offline_data(path: str):
    print(f"Generating offline data for training at {path}")
    trainer = RLTrainer(
        algorithm="PPO",
        run_config=RunConfig(stop={"timesteps_total": 5000}),
        config={
            "env": "CartPole-v0",
            "output": "dataset",
            "output_config": {
                "format": "json",
                "path": path,
                "max_num_samples_per_file": 1,
            },
            "batch_mode": "complete_episodes",
        },
    )
    trainer.fit()

Here we define the training function. It will create an `RLTrainer` using the `PPO` algorithm and kick off training on the `CartPole-v0` environment. It will use the offline data provided in `path` for this.

In [4]:
def train_rl_bc_offline(path: str, num_workers: int, use_gpu: bool = False) -> Result:
    print("Starting offline training")
    dataset = ray.data.read_json(
        path, parallelism=num_workers, ray_remote_args={"num_cpus": 1}
    )

    trainer = RLTrainer(
        run_config=RunConfig(stop={"training_iteration": 5}),
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
        datasets={"train": dataset},
        algorithm=BCTrainer,
        config={
            "env": "CartPole-v0",
            "framework": "tf",
            "evaluation_num_workers": 1,
            "evaluation_interval": 1,
            "evaluation_config": {"input": "sampler"},
        },
    )

    # Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig
    # result = trainer.fit()
    tuner = Tuner(
        trainer,
        _tuner_kwargs={"checkpoint_at_end": True},
    )
    result = tuner.fit()[0]
    return result

Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:

In [5]:
def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:
    predictor = RLPredictor.from_checkpoint(checkpoint)

    env = gym.make("CartPole-v0")

    rewards = []
    for i in range(num_episodes):
        obs = env.reset()
        reward = 0.0
        done = False
        while not done:
            action = predictor.predict(np.array([obs]))
            obs, r, done, _ = env.step(action[0])
            reward += r
        rewards.append(reward)

    return rewards

Let's put it all together. First, we initialize Ray and create the offline data:

In [6]:
ray.init(num_cpus=8)

path = "/tmp/out"
generate_offline_data(path)

2022-05-20 11:57:39,477	INFO services.py:1483 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


Generating offline data for training at /tmp/out


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
AIRPPOTrainer_ab506_00000,TERMINATED,127.0.0.1:28838,2,11.5833,8665,46.31,147,11,46.31


[2m[33m(raylet)[0m 2022-05-20 11:57:42,730	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=8 --runtime-env-hash=-2010331134
[2m[36m(AIRPPOTrainer pid=28838)[0m 2022-05-20 11:57:51,947	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution spe

Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]
Write Progress:   0%|          | 0/1 [00:02<?, ?it/s]
Write Progress: 100%|██████████| 1/1 [00:02<00:00,  2.11s/it]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 149.48it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 113.58it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 148.52it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 227.01it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 194.43it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 263.51it/s]
Repartition: 100%|██████████|

Result for AIRPPOTrainer_ab506_00000:
  agent_timesteps_total: 4305
  counters:
    num_agent_steps_sampled: 4305
    num_agent_steps_trained: 4305
    num_env_steps_sampled: 4305
    num_env_steps_trained: 4305
  custom_metrics: {}
  date: 2022-05-20_11-58-09
  done: false
  episode_len_mean: 21.633165829145728
  episode_media: {}
  episode_reward_max: 83.0
  episode_reward_mean: 21.633165829145728
  episode_reward_min: 9.0
  episodes_this_iter: 199
  episodes_total: 199
  experiment_id: d6ab9eba2e4e488384aa2e958fab71c8
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6652079820632935
          entropy_coeff: 0.0
          kl: 0.027841072529554367
          model: {}
          policy_loss: -0.042915552854537964
          total_loss: 9.028203010559082
          vf_explained_var: -0.05767782777547836
     

Repartition: 100%|██████████| 1/1 [00:00<00:00, 188.97it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 236.59it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 178.06it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 315.36it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 203.67it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 255.77it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 207.51it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 185.77it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 177.55it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 277.47it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 202.14it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 242.84it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 193.57it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 246.67it/s]
Repartition: 100%|██████████| 1/1 [00:00<00:00, 201.46it/s]
Write Progress: 100%|██████████| 1/1 [00:00<00:00, 281.16it/s]
Repartition: 100

Result for AIRPPOTrainer_ab506_00000:
  agent_timesteps_total: 8665
  counters:
    num_agent_steps_sampled: 8665
    num_agent_steps_trained: 8665
    num_env_steps_sampled: 8665
    num_env_steps_trained: 8665
  custom_metrics: {}
  date: 2022-05-20_11-58-13
  done: true
  episode_len_mean: 46.31
  episode_media: {}
  episode_reward_max: 147.0
  episode_reward_mean: 46.31
  episode_reward_min: 11.0
  episodes_this_iter: 88
  episodes_total: 287
  experiment_id: d6ab9eba2e4e488384aa2e958fab71c8
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.6104190349578857
          entropy_coeff: 0.0
          kl: 0.015321698971092701
          model: {}
          policy_loss: -0.025790905579924583
          total_loss: 9.480770111083984
          vf_explained_var: -0.029562775045633316
          vf_loss: 9.501963615

2022-05-20 11:58:13,583	INFO tune.py:753 -- Total run time: 32.49 seconds (31.86 seconds for the tuning loop).


Then, we run training:

In [7]:
result = train_rl_bc_offline(path=path, num_workers=2, use_gpu=False)

Starting offline training


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
AIRBCTrainer_bef2c_00000,TERMINATED,127.0.0.1:28876,5,9.28,2297,,,,


[2m[33m(raylet)[0m 2022-05-20 11:58:14,957	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=15 --runtime-env-hash=-2010331134
[2m[36m(AIRBCTrainer pid=28876)[0m 2022-05-20 11:58:21,973	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution spe

[2m[36m(RolloutWorker pid=28883)[0m DatasetReader  2  has  57  samples.
[2m[36m(RolloutWorker pid=28882)[0m DatasetReader  1  has  57  samples.


Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]
[2m[33m(raylet)[0m 2022-05-20 11:58:31,224	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331134


Result for AIRBCTrainer_bef2c_00000:
  agent_timesteps_total: 445
  counters:
    num_agent_steps_sampled: 445
    num_agent_steps_trained: 2000
    num_env_steps_sampled: 445
    num_env_steps_trained: 2000
  custom_metrics: {}
  date: 2022-05-20_11-58-38
  done: false
  episode_len_mean: .nan
  episode_media: {}
  episode_reward_max: .nan
  episode_reward_mean: .nan
  episode_reward_min: .nan
  episodes_this_iter: 0
  episodes_total: 0
  evaluation:
    custom_metrics: {}
    episode_len_mean: 22.5
    episode_media: {}
    episode_reward_max: 54.0
    episode_reward_mean: 22.5
    episode_reward_min: 10.0
    episodes_this_iter: 10
    hist_stats:
      episode_lengths:
      - 30
      - 10
      - 18
      - 54
      - 31
      - 14
      - 18
      - 16
      - 11
      - 23
      episode_reward:
      - 30.0
      - 10.0
      - 18.0
      - 54.0
      - 31.0
      - 14.0
      - 18.0
      - 16.0
      - 11.0
      - 23.0
    off_policy_estimator: {}
    policy_reward_max: {}
 

2022-05-20 11:58:40,413	INFO tune.py:753 -- Total run time: 26.38 seconds (25.84 seconds for the tuning loop).
Read progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.78it/s]


And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:

In [8]:
num_eval_episodes = 3

rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)
print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}")

2022-05-20 11:58:40,636	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
2022-05-20 11:58:40,638	INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
Read: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.58it/s]
Repartition: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.84it/s]


[2m[36m(RolloutWorker pid=28906)[0m DatasetReader  1  has  57  samples.
[2m[36m(RolloutWorker pid=28907)[0m DatasetReader  2  has  57  samples.


2022-05-20 11:58:50,042	INFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /Users/kai/ray_results/AIRBCTrainer_2022-05-20_11-58-14/AIRBCTrainer_bef2c_00000_0_2022-05-20_11-58-14/checkpoint_000005/checkpoint-5
2022-05-20 11:58:50,043	INFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 9.279996871948242, '_episodes_total': 0}


Average reward over 3 episodes: 41.333333333333336


