# Online reinforcement learning with Ray AIR
In this example, we'll train a reinforcement learning agent using online training.

Online training means that the data from the environment is sampled while we are running the algorithm. In contrast, offline training uses data that has been stored on disk before.

Let's start with installing our dependencies:

In [1]:
!pip install -qU "ray[rllib]" gym

Now we can run some imports:

In [2]:
import argparse
import gym
import os

import numpy as np
import ray
from ray.air import Checkpoint
from ray.air.config import RunConfig
from ray.train.rl.rl_predictor import RLPredictor
from ray.train.rl.rl_trainer import RLTrainer
from ray.air.config import ScalingConfig
from ray.air.result import Result
from ray.rllib.agents.marwil import BCTrainer
from ray.tune.tuner import Tuner



Here we define the training function. It will create an `RLTrainer` using the `PPO` algorithm and kick off training on the `CartPole-v0` environment:

In [3]:
def train_rl_ppo_online(num_workers: int, use_gpu: bool = False) -> Result:
    print("Starting online training")
    trainer = RLTrainer(
        run_config=RunConfig(stop={"training_iteration": 5}),
        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
        algorithm="PPO",
        config={
            "env": "CartPole-v0",
            "framework": "tf",
        },
    )
    # Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig
    # result = trainer.fit()
    tuner = Tuner(
        trainer,
        _tuner_kwargs={"checkpoint_at_end": True},
    )
    result = tuner.fit()[0]
    return result

Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:

In [4]:
def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:
    predictor = RLPredictor.from_checkpoint(checkpoint)

    env = gym.make("CartPole-v0")

    rewards = []
    for i in range(num_episodes):
        obs = env.reset()
        reward = 0.0
        done = False
        while not done:
            action = predictor.predict(np.array([obs]))
            obs, r, done, _ = env.step(action[0])
            reward += r
        rewards.append(reward)

    return rewards

Let's put it all together. First, we run training:

In [5]:
result = train_rl_ppo_online(num_workers=2, use_gpu=False)



Starting online training


2022-05-19 13:54:19,326	INFO services.py:1483 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8267[39m[22m


Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
AIRPPOTrainer_cd8d6_00000,TERMINATED,127.0.0.1:14174,5,16.7029,20000,124.79,200,9,124.79


[2m[33m(raylet)[0m 2022-05-19 13:54:23,061	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134
[2m[36m(AIRPPOTrainer pid=14174)[0m 2022-05-19 13:54:30,749	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution sp

Result for AIRPPOTrainer_cd8d6_00000:
  agent_timesteps_total: 4000
  counters:
    num_agent_steps_sampled: 4000
    num_agent_steps_trained: 4000
    num_env_steps_sampled: 4000
    num_env_steps_trained: 4000
  custom_metrics: {}
  date: 2022-05-19_13-54-44
  done: false
  episode_len_mean: 22.11731843575419
  episode_media: {}
  episode_reward_max: 87.0
  episode_reward_mean: 22.11731843575419
  episode_reward_min: 8.0
  episodes_this_iter: 179
  episodes_total: 179
  experiment_id: 158c57d8b6e142ad85b393db57c8bdff
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.20000000298023224
          cur_lr: 4.999999873689376e-05
          entropy: 0.6653298139572144
          entropy_coeff: 0.0
          kl: 0.02798665314912796
          model: {}
          policy_loss: -0.0422092080116272
          total_loss: 8.986403465270996
          vf_explained_var: -0.06533512473106384
          

Result for AIRPPOTrainer_cd8d6_00000:
  agent_timesteps_total: 20000
  counters:
    num_agent_steps_sampled: 20000
    num_agent_steps_trained: 20000
    num_env_steps_sampled: 20000
    num_env_steps_trained: 20000
  custom_metrics: {}
  date: 2022-05-19_13-54-57
  done: true
  episode_len_mean: 124.79
  episode_media: {}
  episode_reward_max: 200.0
  episode_reward_mean: 124.79
  episode_reward_min: 9.0
  episodes_this_iter: 20
  episodes_total: 354
  experiment_id: 158c57d8b6e142ad85b393db57c8bdff
  hostname: Kais-MacBook-Pro.local
  info:
    learner:
      default_policy:
        custom_metrics: {}
        learner_stats:
          cur_kl_coeff: 0.30000001192092896
          cur_lr: 4.999999873689376e-05
          entropy: 0.5436986684799194
          entropy_coeff: 0.0
          kl: 0.0034858626313507557
          model: {}
          policy_loss: -0.012989979237318039
          total_loss: 9.49295425415039
          vf_explained_var: 0.025460055097937584
          vf_loss: 9.5048

2022-05-19 13:54:58,548	INFO tune.py:753 -- Total run time: 36.92 seconds (35.95 seconds for the tuning loop).


And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:

In [6]:
num_eval_episodes = 3

rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)
print(f"Average reward over {num_eval_episodes} episodes: " f"{np.mean(rewards)}")

2022-05-19 13:54:58,589	INFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
2022-05-19 13:54:58,591	INFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
2022-05-19 13:54:58,591	INFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2022-05-19 13:55:08,021	INFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16/AIRPPOTrainer_cd8d6_00000_0_2022-05-19_13-54-22/checkpoint_000005/checkpoint-5
2022-05-19 13:55:08,021	INFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': N

Average reward over 3 episodes: 200.0
