ray/doc/source/ray-air/examples/rl_offline_example.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "57fe8246",
   "metadata": {},
   "source": [
    "# Offline reinforcement learning with Ray AIR\n",
    "In this example, we'll train a reinforcement learning agent using offline training.\n",
    "\n",
    "Offline training means that the data from the environment (and the actions performed by the agent) have been stored on disk. In contrast, online training samples experiences live by interacting with the environment."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edc8d8ac",
   "metadata": {},
   "source": [
    "Let's start with installing our dependencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "0ef2e884",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -qU \"ray[rllib]\" gym"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "503b1b55",
   "metadata": {},
   "source": [
    "Now we can run some imports:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "db0a45ff",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-05-20 11:57:36,802\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
      "2022-05-20 11:57:36,815\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.marwil` has been deprecated. Use `ray.rllib.algorithms.marwil` instead. This will raise an error in the future!\n"
     ]
    }
   ],
   "source": [
    "import argparse\n",
    "import gym\n",
    "import os\n",
    "\n",
    "import numpy as np\n",
    "import ray\n",
    "from ray.air import Checkpoint\n",
    "from ray.air.config import RunConfig\n",
    "from ray.train.rl.rl_predictor import RLPredictor\n",
    "from ray.train.rl.rl_trainer import RLTrainer\n",
    "from ray.air.result import Result\n",
    "from ray.rllib.agents.marwil import BCTrainer\n",
    "from ray.tune.tuner import Tuner"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "184fe936",
   "metadata": {},
   "source": [
    "We will be training on offline data - this means we have full agent trajectories stored somewhere on disk and want to train on these past experiences.\n",
    "\n",
    "Usually this data could come from external systems, or a database of historical data. But for this example, we'll generate some offline data ourselves and store it using RLlibs `output_config`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5aeed761",
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_offline_data(path: str):\n",
    "    print(f\"Generating offline data for training at {path}\")\n",
    "    trainer = RLTrainer(\n",
    "        algorithm=\"PPO\",\n",
    "        run_config=RunConfig(stop={\"timesteps_total\": 5000}),\n",
    "        config={\n",
    "            \"env\": \"CartPole-v0\",\n",
    "            \"output\": \"dataset\",\n",
    "            \"output_config\": {\n",
    "                \"format\": \"json\",\n",
    "                \"path\": path,\n",
    "                \"max_num_samples_per_file\": 1,\n",
    "            },\n",
    "            \"batch_mode\": \"complete_episodes\",\n",
    "        },\n",
    "    )\n",
    "    trainer.fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bca906c",
   "metadata": {},
   "source": [
    "Here we define the training function. It will create an `RLTrainer` using the `PPO` algorithm and kick off training on the `CartPole-v0` environment. It will use the offline data provided in `path` for this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f5071ce0",
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_rl_bc_offline(path: str, num_workers: int, use_gpu: bool = False) -> Result:\n",
    "    print(\"Starting offline training\")\n",
    "    dataset = ray.data.read_json(\n",
    "        path, parallelism=num_workers, ray_remote_args={\"num_cpus\": 1}\n",
    "    )\n",
    "\n",
    "    trainer = RLTrainer(\n",
    "        run_config=RunConfig(stop={\"training_iteration\": 5}),\n",
    "        scaling_config={\n",
    "            \"num_workers\": num_workers,\n",
    "            \"use_gpu\": use_gpu,\n",
    "        },\n",
    "        datasets={\"train\": dataset},\n",
    "        algorithm=BCTrainer,\n",
    "        config={\n",
    "            \"env\": \"CartPole-v0\",\n",
    "            \"framework\": \"tf\",\n",
    "            \"evaluation_num_workers\": 1,\n",
    "            \"evaluation_interval\": 1,\n",
    "            \"evaluation_config\": {\"input\": \"sampler\"},\n",
    "        },\n",
    "    )\n",
    "\n",
    "    # Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig\n",
    "    # result = trainer.fit()\n",
    "    tuner = Tuner(\n",
    "        trainer,\n",
    "        _tuner_kwargs={\"checkpoint_at_end\": True},\n",
    "    )\n",
    "    result = tuner.fit()[0]\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d935cdee",
   "metadata": {},
   "source": [
    "Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2628f3b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:\n",
    "    predictor = RLPredictor.from_checkpoint(checkpoint)\n",
    "\n",
    "    env = gym.make(\"CartPole-v0\")\n",
    "\n",
    "    rewards = []\n",
    "    for i in range(num_episodes):\n",
    "        obs = env.reset()\n",
    "        reward = 0.0\n",
    "        done = False\n",
    "        while not done:\n",
    "            action = predictor.predict([obs])\n",
    "            obs, r, done, _ = env.step(action[0])\n",
    "            reward += r\n",
    "        rewards.append(reward)\n",
    "\n",
    "    return rewards"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84f4bebe",
   "metadata": {},
   "source": [
    "Let's put it all together. First, we initialize Ray and create the offline data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cae1337e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-05-20 11:57:39,477\tINFO services.py:1483 -- View the Ray dashboard at \u001B[1m\u001B[32mhttp://127.0.0.1:8265\u001B[39m\u001B[22m\n",
      "2022-05-20 11:57:40,910\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.agents.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future!\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generating offline data for training at /tmp/out\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "== Status ==<br>Current time: 2022-05-20 11:58:13 (running for 00:00:31.89)<br>Memory usage on this node: 10.0/16.0 GiB<br>Using FIFO scheduling algorithm.<br>Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/4.13 GiB heap, 0.0/2.0 GiB objects<br>Result logdir: /Users/kai/ray_results/AIRPPOTrainer_2022-05-20_11-57-41<br>Number of trials: 1/1 (1 TERMINATED)<br><table>\n",
       "<thead>\n",
       "<tr><th>Trial name               </th><th>status    </th><th>loc            </th><th style=\"text-align: right;\">  iter</th><th style=\"text-align: right;\">  total time (s)</th><th style=\"text-align: right;\">  ts</th><th style=\"text-align: right;\">  reward</th><th style=\"text-align: right;\">  episode_reward_max</th><th style=\"text-align: right;\">  episode_reward_min</th><th style=\"text-align: right;\">  episode_len_mean</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td>AIRPPOTrainer_ab506_00000</td><td>TERMINATED</td><td>127.0.0.1:28838</td><td style=\"text-align: right;\">     2</td><td style=\"text-align: right;\">         11.5833</td><td style=\"text-align: right;\">8665</td><td style=\"text-align: right;\">   46.31</td><td style=\"text-align: right;\">                 147</td><td style=\"text-align: right;\">                  11</td><td style=\"text-align: right;\">             46.31</td></tr>\n",
       "</tbody>\n",
       "</table><br><br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-20 11:57:42,730\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=8 --runtime-env-hash=-2010331134\n",
      "\u001B[2m\u001B[36m(pid=28838)\u001B[0m 2022-05-20 11:57:51,258\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
      "\u001B[2m\u001B[36m(AIRPPOTrainer pid=28838)\u001B[0m 2022-05-20 11:57:51,947\tINFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.\n",
      "\u001B[2m\u001B[36m(AIRPPOTrainer pid=28838)\u001B[0m 2022-05-20 11:57:51,948\tINFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.\n",
      "\u001B[2m\u001B[36m(AIRPPOTrainer pid=28838)\u001B[0m 2022-05-20 11:57:51,948\tINFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.\n",
      "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-20 11:57:53,104\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=9 --runtime-env-hash=-2010331134\n",
      "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-20 11:57:53,104\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=10 --runtime-env-hash=-2010331134\n",
      "\u001B[2m\u001B[36m(RolloutWorker pid=28848)\u001B[0m 2022-05-20 11:58:00,061\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
      "\u001B[2m\u001B[36m(RolloutWorker pid=28849)\u001B[0m 2022-05-20 11:58:00,061\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
      "\u001B[2m\u001B[36m(AIRPPOTrainer pid=28838)\u001B[0m 2022-05-20 11:58:01,467\tWARNING util.py:65 -- Install gputil for GPU system monitoring.\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-20 11:58:02,584\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=11 --runtime-env-hash=-2010331069\n",
      "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-20 11:58:02,584\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=12 --runtime-env-hash=-2010331069\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:01<00:00,  1.98s/it]\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:02<00:00,  2.04s/it]\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 38.96it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-20 11:58:04,608\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=13 --runtime-env-hash=-2010331069\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:01<?, ?it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:02<?, ?it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:02<00:00,  2.11s/it]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 149.48it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 113.58it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 148.52it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 227.01it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 194.43it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 263.51it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 158.20it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 296.46it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 158.08it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 195.96it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 183.05it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 312.01it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 216.03it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 289.20it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 210.04it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 263.99it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 165.20it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 224.62it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 198.53it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 338.41it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 193.87it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 266.95it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 195.85it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 302.64it/s]\n",
      "Repartition:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 185.63it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 185.39it/s]\n",
      "Write Progress:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 300.90it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 238.33it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 259.00it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 313.19it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 218.53it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 278.25it/s]\n",
      "\u001B[2m\u001B[36m(AIRPPOTrainer pid=28838)\u001B[0m 2022-05-20 11:58:07,504\tWARNING deprecation.py:47 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 264.41it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 329.79it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 215.19it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 299.66it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result for AIRPPOTrainer_ab506_00000:\n",
      "  agent_timesteps_total: 4305\n",
      "  counters:\n",
      "    num_agent_steps_sampled: 4305\n",
      "    num_agent_steps_trained: 4305\n",
      "    num_env_steps_sampled: 4305\n",
      "    num_env_steps_trained: 4305\n",
      "  custom_metrics: {}\n",
      "  date: 2022-05-20_11-58-09\n",
      "  done: false\n",
      "  episode_len_mean: 21.633165829145728\n",
      "  episode_media: {}\n",
      "  episode_reward_max: 83.0\n",
      "  episode_reward_mean: 21.633165829145728\n",
      "  episode_reward_min: 9.0\n",
      "  episodes_this_iter: 199\n",
      "  episodes_total: 199\n",
      "  experiment_id: d6ab9eba2e4e488384aa2e958fab71c8\n",
      "  hostname: Kais-MacBook-Pro.local\n",
      "  info:\n",
      "    learner:\n",
      "      default_policy:\n",
      "        custom_metrics: {}\n",
      "        learner_stats:\n",
      "          cur_kl_coeff: 0.20000000298023224\n",
      "          cur_lr: 4.999999873689376e-05\n",
      "          entropy: 0.6652079820632935\n",
      "          entropy_coeff: 0.0\n",
      "          kl: 0.027841072529554367\n",
      "          model: {}\n",
      "          policy_loss: -0.042915552854537964\n",
      "          total_loss: 9.028203010559082\n",
      "          vf_explained_var: -0.05767782777547836\n",
      "          vf_loss: 9.065549850463867\n",
      "        num_agent_steps_trained: 128.0\n",
      "    num_agent_steps_sampled: 4305\n",
      "    num_agent_steps_trained: 4305\n",
      "    num_env_steps_sampled: 4305\n",
      "    num_env_steps_trained: 4305\n",
      "  iterations_since_restore: 1\n",
      "  node_ip: 127.0.0.1\n",
      "  num_agent_steps_sampled: 4305\n",
      "  num_agent_steps_trained: 4305\n",
      "  num_env_steps_sampled: 4305\n",
      "  num_env_steps_sampled_this_iter: 4305\n",
      "  num_env_steps_trained: 4305\n",
      "  num_env_steps_trained_this_iter: 4305\n",
      "  num_healthy_workers: 2\n",
      "  off_policy_estimator: {}\n",
      "  perf:\n",
      "    cpu_util_percent: 16.474999999999998\n",
      "    ram_util_percent: 61.041666666666664\n",
      "  pid: 28838\n",
      "  policy_reward_max: {}\n",
      "  policy_reward_mean: {}\n",
      "  policy_reward_min: {}\n",
      "  sampler_perf:\n",
      "    mean_action_processing_ms: 0.06155790977082133\n",
      "    mean_env_render_ms: 0.0\n",
      "    mean_env_wait_ms: 0.04961143452632256\n",
      "    mean_inference_ms: 0.5584241294994345\n",
      "    mean_raw_obs_processing_ms: 0.09605169519383157\n",
      "  sampler_results:\n",
      "    custom_metrics: {}\n",
      "    episode_len_mean: 21.633165829145728\n",
      "    episode_media: {}\n",
      "    episode_reward_max: 83.0\n",
      "    episode_reward_mean: 21.633165829145728\n",
      "    episode_reward_min: 9.0\n",
      "    episodes_this_iter: 199\n",
      "    hist_stats:\n",
      "      episode_lengths:\n",
      "      - 19\n",
      "      - 13\n",
      "      - 43\n",
      "      - 26\n",
      "      - 13\n",
      "      - 16\n",
      "      - 13\n",
      "      - 12\n",
      "      - 13\n",
      "      - 27\n",
      "      - 40\n",
      "      - 18\n",
      "      - 14\n",
      "      - 16\n",
      "      - 19\n",
      "      - 19\n",
      "      - 12\n",
      "      - 13\n",
      "      - 10\n",
      "      - 24\n",
      "      - 16\n",
      "      - 18\n",
      "      - 15\n",
      "      - 11\n",
      "      - 16\n",
      "      - 63\n",
      "      - 14\n",
      "      - 15\n",
      "      - 30\n",
      "      - 12\n",
      "      - 13\n",
      "      - 20\n",
      "      - 21\n",
      "      - 20\n",
      "      - 28\n",
      "      - 29\n",
      "      - 22\n",
      "      - 20\n",
      "      - 16\n",
      "      - 14\n",
      "      - 13\n",
      "      - 17\n",
      "      - 21\n",
      "      - 12\n",
      "      - 31\n",
      "      - 25\n",
      "      - 27\n",
      "      - 19\n",
      "      - 18\n",
      "      - 28\n",
      "      - 15\n",
      "      - 19\n",
      "      - 14\n",
      "      - 22\n",
      "      - 19\n",
      "      - 22\n",
      "      - 34\n",
      "      - 43\n",
      "      - 18\n",
      "      - 17\n",
      "      - 31\n",
      "      - 18\n",
      "      - 12\n",
      "      - 13\n",
      "      - 21\n",
      "      - 16\n",
      "      - 10\n",
      "      - 24\n",
      "      - 22\n",
      "      - 9\n",
      "      - 12\n",
      "      - 34\n",
      "      - 26\n",
      "      - 19\n",
      "      - 71\n",
      "      - 14\n",
      "      - 21\n",
      "      - 29\n",
      "      - 12\n",
      "      - 10\n",
      "      - 9\n",
      "      - 12\n",
      "      - 26\n",
      "      - 13\n",
      "      - 15\n",
      "      - 14\n",
      "      - 25\n",
      "      - 21\n",
      "      - 13\n",
      "      - 21\n",
      "      - 18\n",
      "      - 16\n",
      "      - 20\n",
      "      - 18\n",
      "      - 50\n",
      "      - 25\n",
      "      - 12\n",
      "      - 13\n",
      "      - 16\n",
      "      - 28\n",
      "      - 14\n",
      "      - 11\n",
      "      - 25\n",
      "      - 10\n",
      "      - 19\n",
      "      - 23\n",
      "      - 27\n",
      "      - 11\n",
      "      - 34\n",
      "      - 9\n",
      "      - 12\n",
      "      - 30\n",
      "      - 15\n",
      "      - 59\n",
      "      - 13\n",
      "      - 49\n",
      "      - 39\n",
      "      - 24\n",
      "      - 33\n",
      "      - 10\n",
      "      - 66\n",
      "      - 21\n",
      "      - 30\n",
      "      - 19\n",
      "      - 17\n",
      "      - 29\n",
      "      - 25\n",
      "      - 19\n",
      "      - 83\n",
      "      - 12\n",
      "      - 12\n",
      "      - 27\n",
      "      - 12\n",
      "      - 31\n",
      "      - 17\n",
      "      - 27\n",
      "      - 18\n",
      "      - 14\n",
      "      - 16\n",
      "      - 21\n",
      "      - 13\n",
      "      - 30\n",
      "      - 34\n",
      "      - 10\n",
      "      - 15\n",
      "      - 14\n",
      "      - 18\n",
      "      - 23\n",
      "      - 36\n",
      "      - 35\n",
      "      - 16\n",
      "      - 20\n",
      "      - 15\n",
      "      - 22\n",
      "      - 9\n",
      "      - 22\n",
      "      - 22\n",
      "      - 12\n",
      "      - 13\n",
      "      - 11\n",
      "      - 22\n",
      "      - 21\n",
      "      - 48\n",
      "      - 12\n",
      "      - 14\n",
      "      - 16\n",
      "      - 44\n",
      "      - 13\n",
      "      - 14\n",
      "      - 33\n",
      "      - 32\n",
      "      - 26\n",
      "      - 24\n",
      "      - 22\n",
      "      - 27\n",
      "      - 16\n",
      "      - 20\n",
      "      - 14\n",
      "      - 12\n",
      "      - 59\n",
      "      - 13\n",
      "      - 12\n",
      "      - 22\n",
      "      - 31\n",
      "      - 31\n",
      "      - 13\n",
      "      - 14\n",
      "      - 15\n",
      "      - 35\n",
      "      - 14\n",
      "      - 28\n",
      "      - 21\n",
      "      - 15\n",
      "      - 41\n",
      "      - 22\n",
      "      - 13\n",
      "      - 21\n",
      "      - 11\n",
      "      - 35\n",
      "      episode_reward:\n",
      "      - 19.0\n",
      "      - 13.0\n",
      "      - 43.0\n",
      "      - 26.0\n",
      "      - 13.0\n",
      "      - 16.0\n",
      "      - 13.0\n",
      "      - 12.0\n",
      "      - 13.0\n",
      "      - 27.0\n",
      "      - 40.0\n",
      "      - 18.0\n",
      "      - 14.0\n",
      "      - 16.0\n",
      "      - 19.0\n",
      "      - 19.0\n",
      "      - 12.0\n",
      "      - 13.0\n",
      "      - 10.0\n",
      "      - 24.0\n",
      "      - 16.0\n",
      "      - 18.0\n",
      "      - 15.0\n",
      "      - 11.0\n",
      "      - 16.0\n",
      "      - 63.0\n",
      "      - 14.0\n",
      "      - 15.0\n",
      "      - 30.0\n",
      "      - 12.0\n",
      "      - 13.0\n",
      "      - 20.0\n",
      "      - 21.0\n",
      "      - 20.0\n",
      "      - 28.0\n",
      "      - 29.0\n",
      "      - 22.0\n",
      "      - 20.0\n",
      "      - 16.0\n",
      "      - 14.0\n",
      "      - 13.0\n",
      "      - 17.0\n",
      "      - 21.0\n",
      "      - 12.0\n",
      "      - 31.0\n",
      "      - 25.0\n",
      "      - 27.0\n",
      "      - 19.0\n",
      "      - 18.0\n",
      "      - 28.0\n",
      "      - 15.0\n",
      "      - 19.0\n",
      "      - 14.0\n",
      "      - 22.0\n",
      "      - 19.0\n",
      "      - 22.0\n",
      "      - 34.0\n",
      "      - 43.0\n",
      "      - 18.0\n",
      "      - 17.0\n",
      "      - 31.0\n",
      "      - 18.0\n",
      "      - 12.0\n",
      "      - 13.0\n",
      "      - 21.0\n",
      "      - 16.0\n",
      "      - 10.0\n",
      "      - 24.0\n",
      "      - 22.0\n",
      "      - 9.0\n",
      "      - 12.0\n",
      "      - 34.0\n",
      "      - 26.0\n",
      "      - 19.0\n",
      "      - 71.0\n",
      "      - 14.0\n",
      "      - 21.0\n",
      "      - 29.0\n",
      "      - 12.0\n",
      "      - 10.0\n",
      "      - 9.0\n",
      "      - 12.0\n",
      "      - 26.0\n",
      "      - 13.0\n",
      "      - 15.0\n",
      "      - 14.0\n",
      "      - 25.0\n",
      "      - 21.0\n",
      "      - 13.0\n",
      "      - 21.0\n",
      "      - 18.0\n",
      "      - 16.0\n",
      "      - 20.0\n",
      "      - 18.0\n",
      "      - 50.0\n",
      "      - 25.0\n",
      "      - 12.0\n",
      "      - 13.0\n",
      "      - 16.0\n",
      "      - 28.0\n",
      "      - 14.0\n",
      "      - 11.0\n",
      "      - 25.0\n",
      "      - 10.0\n",
      "      - 19.0\n",
      "      - 23.0\n",
      "      - 27.0\n",
      "      - 11.0\n",
      "      - 34.0\n",
      "      - 9.0\n",
      "      - 12.0\n",
      "      - 30.0\n",
      "      - 15.0\n",
      "      - 59.0\n",
      "      - 13.0\n",
      "      - 49.0\n",
      "      - 39.0\n",
      "      - 24.0\n",
      "      - 33.0\n",
      "      - 10.0\n",
      "      - 66.0\n",
      "      - 21.0\n",
      "      - 30.0\n",
      "      - 19.0\n",
      "      - 17.0\n",
      "      - 29.0\n",
      "      - 25.0\n",
      "      - 19.0\n",
      "      - 83.0\n",
      "      - 12.0\n",
      "      - 12.0\n",
      "      - 27.0\n",
      "      - 12.0\n",
      "      - 31.0\n",
      "      - 17.0\n",
      "      - 27.0\n",
      "      - 18.0\n",
      "      - 14.0\n",
      "      - 16.0\n",
      "      - 21.0\n",
      "      - 13.0\n",
      "      - 30.0\n",
      "      - 34.0\n",
      "      - 10.0\n",
      "      - 15.0\n",
      "      - 14.0\n",
      "      - 18.0\n",
      "      - 23.0\n",
      "      - 36.0\n",
      "      - 35.0\n",
      "      - 16.0\n",
      "      - 20.0\n",
      "      - 15.0\n",
      "      - 22.0\n",
      "      - 9.0\n",
      "      - 22.0\n",
      "      - 22.0\n",
      "      - 12.0\n",
      "      - 13.0\n",
      "      - 11.0\n",
      "      - 22.0\n",
      "      - 21.0\n",
      "      - 48.0\n",
      "      - 12.0\n",
      "      - 14.0\n",
      "      - 16.0\n",
      "      - 44.0\n",
      "      - 13.0\n",
      "      - 14.0\n",
      "      - 33.0\n",
      "      - 32.0\n",
      "      - 26.0\n",
      "      - 24.0\n",
      "      - 22.0\n",
      "      - 27.0\n",
      "      - 16.0\n",
      "      - 20.0\n",
      "      - 14.0\n",
      "      - 12.0\n",
      "      - 59.0\n",
      "      - 13.0\n",
      "      - 12.0\n",
      "      - 22.0\n",
      "      - 31.0\n",
      "      - 31.0\n",
      "      - 13.0\n",
      "      - 14.0\n",
      "      - 15.0\n",
      "      - 35.0\n",
      "      - 14.0\n",
      "      - 28.0\n",
      "      - 21.0\n",
      "      - 15.0\n",
      "      - 41.0\n",
      "      - 22.0\n",
      "      - 13.0\n",
      "      - 21.0\n",
      "      - 11.0\n",
      "      - 35.0\n",
      "    off_policy_estimator: {}\n",
      "    policy_reward_max: {}\n",
      "    policy_reward_mean: {}\n",
      "    policy_reward_min: {}\n",
      "    sampler_perf:\n",
      "      mean_action_processing_ms: 0.06155790977082133\n",
      "      mean_env_render_ms: 0.0\n",
      "      mean_env_wait_ms: 0.04961143452632256\n",
      "      mean_inference_ms: 0.5584241294994345\n",
      "      mean_raw_obs_processing_ms: 0.09605169519383157\n",
      "  time_since_restore: 7.9085540771484375\n",
      "  time_this_iter_s: 7.9085540771484375\n",
      "  time_total_s: 7.9085540771484375\n",
      "  timers:\n",
      "    learn_throughput: 2306.994\n",
      "    learn_time_ms: 1866.064\n",
      "    load_throughput: 22514312.618\n",
      "    load_time_ms: 0.191\n",
      "    training_iteration_time_ms: 7904.312\n",
      "    update_time_ms: 2.387\n",
      "  timestamp: 1653044289\n",
      "  timesteps_since_restore: 0\n",
      "  timesteps_total: 4305\n",
      "  training_iteration: 1\n",
      "  trial_id: ab506_00000\n",
      "  warmup_time: 9.528029203414917\n",
      "  \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 188.97it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 236.59it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 178.06it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 315.36it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 203.67it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 255.77it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 207.51it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 185.77it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 177.55it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 277.47it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 202.14it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 242.84it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 193.57it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 246.67it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 201.46it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 281.16it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 202.47it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 290.54it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 249.19it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 270.48it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 263.02it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 294.46it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 175.01it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 285.23it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 246.56it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 270.58it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 236.35it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 295.77it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 175.38it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 268.61it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 250.06it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 290.12it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 179.67it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 234.74it/s]\n",
      "Repartition: 100%|██████████| 1/1 [00:00<00:00, 233.09it/s]\n",
      "Write Progress: 100%|██████████| 1/1 [00:00<00:00, 279.17it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result for AIRPPOTrainer_ab506_00000:\n",
      "  agent_timesteps_total: 8665\n",
      "  counters:\n",
      "    num_agent_steps_sampled: 8665\n",
      "    num_agent_steps_trained: 8665\n",
      "    num_env_steps_sampled: 8665\n",
      "    num_env_steps_trained: 8665\n",
      "  custom_metrics: {}\n",
      "  date: 2022-05-20_11-58-13\n",
      "  done: true\n",
      "  episode_len_mean: 46.31\n",
      "  episode_media: {}\n",
      "  episode_reward_max: 147.0\n",
      "  episode_reward_mean: 46.31\n",
      "  episode_reward_min: 11.0\n",
      "  episodes_this_iter: 88\n",
      "  episodes_total: 287\n",
      "  experiment_id: d6ab9eba2e4e488384aa2e958fab71c8\n",
      "  hostname: Kais-MacBook-Pro.local\n",
      "  info:\n",
      "    learner:\n",
      "      default_policy:\n",
      "        custom_metrics: {}\n",
      "        learner_stats:\n",
      "          cur_kl_coeff: 0.30000001192092896\n",
      "          cur_lr: 4.999999873689376e-05\n",
      "          entropy: 0.6104190349578857\n",
      "          entropy_coeff: 0.0\n",
      "          kl: 0.015321698971092701\n",
      "          model: {}\n",
      "          policy_loss: -0.025790905579924583\n",
      "          total_loss: 9.480770111083984\n",
      "          vf_explained_var: -0.029562775045633316\n",
      "          vf_loss: 9.50196361541748\n",
      "        num_agent_steps_trained: 128.0\n",
      "    num_agent_steps_sampled: 8665\n",
      "    num_agent_steps_trained: 8665\n",
      "    num_env_steps_sampled: 8665\n",
      "    num_env_steps_trained: 8665\n",
      "  iterations_since_restore: 2\n",
      "  node_ip: 127.0.0.1\n",
      "  num_agent_steps_sampled: 8665\n",
      "  num_agent_steps_trained: 8665\n",
      "  num_env_steps_sampled: 8665\n",
      "  num_env_steps_sampled_this_iter: 4360\n",
      "  num_env_steps_trained: 8665\n",
      "  num_env_steps_trained_this_iter: 4360\n",
      "  num_healthy_workers: 2\n",
      "  off_policy_estimator: {}\n",
      "  perf:\n",
      "    cpu_util_percent: 24.18\n",
      "    ram_util_percent: 62.260000000000005\n",
      "  pid: 28838\n",
      "  policy_reward_max: {}\n",
      "  policy_reward_mean: {}\n",
      "  policy_reward_min: {}\n",
      "  sampler_perf:\n",
      "    mean_action_processing_ms: 0.06236081053304994\n",
      "    mean_env_render_ms: 0.0\n",
      "    mean_env_wait_ms: 0.05041366691869162\n",
      "    mean_inference_ms: 0.5623494344695713\n",
      "    mean_raw_obs_processing_ms: 0.09146254327599868\n",
      "  sampler_results:\n",
      "    custom_metrics: {}\n",
      "    episode_len_mean: 46.31\n",
      "    episode_media: {}\n",
      "    episode_reward_max: 147.0\n",
      "    episode_reward_mean: 46.31\n",
      "    episode_reward_min: 11.0\n",
      "    episodes_this_iter: 88\n",
      "    hist_stats:\n",
      "      episode_lengths:\n",
      "      - 15\n",
      "      - 35\n",
      "      - 14\n",
      "      - 28\n",
      "      - 21\n",
      "      - 15\n",
      "      - 41\n",
      "      - 22\n",
      "      - 13\n",
      "      - 21\n",
      "      - 11\n",
      "      - 35\n",
      "      - 13\n",
      "      - 24\n",
      "      - 62\n",
      "      - 35\n",
      "      - 25\n",
      "      - 37\n",
      "      - 47\n",
      "      - 112\n",
      "      - 33\n",
      "      - 22\n",
      "      - 45\n",
      "      - 24\n",
      "      - 72\n",
      "      - 19\n",
      "      - 62\n",
      "      - 67\n",
      "      - 42\n",
      "      - 113\n",
      "      - 46\n",
      "      - 28\n",
      "      - 74\n",
      "      - 96\n",
      "      - 20\n",
      "      - 24\n",
      "      - 22\n",
      "      - 31\n",
      "      - 17\n",
      "      - 14\n",
      "      - 129\n",
      "      - 32\n",
      "      - 31\n",
      "      - 27\n",
      "      - 108\n",
      "      - 62\n",
      "      - 12\n",
      "      - 45\n",
      "      - 27\n",
      "      - 45\n",
      "      - 37\n",
      "      - 93\n",
      "      - 52\n",
      "      - 54\n",
      "      - 59\n",
      "      - 86\n",
      "      - 22\n",
      "      - 38\n",
      "      - 46\n",
      "      - 16\n",
      "      - 22\n",
      "      - 37\n",
      "      - 70\n",
      "      - 13\n",
      "      - 83\n",
      "      - 78\n",
      "      - 40\n",
      "      - 147\n",
      "      - 27\n",
      "      - 81\n",
      "      - 29\n",
      "      - 21\n",
      "      - 24\n",
      "      - 42\n",
      "      - 61\n",
      "      - 58\n",
      "      - 72\n",
      "      - 16\n",
      "      - 25\n",
      "      - 52\n",
      "      - 116\n",
      "      - 22\n",
      "      - 17\n",
      "      - 76\n",
      "      - 102\n",
      "      - 26\n",
      "      - 42\n",
      "      - 81\n",
      "      - 47\n",
      "      - 22\n",
      "      - 16\n",
      "      - 59\n",
      "      - 122\n",
      "      - 86\n",
      "      - 100\n",
      "      - 19\n",
      "      - 18\n",
      "      - 18\n",
      "      - 19\n",
      "      - 107\n",
      "      episode_reward:\n",
      "      - 15.0\n",
      "      - 35.0\n",
      "      - 14.0\n",
      "      - 28.0\n",
      "      - 21.0\n",
      "      - 15.0\n",
      "      - 41.0\n",
      "      - 22.0\n",
      "      - 13.0\n",
      "      - 21.0\n",
      "      - 11.0\n",
      "      - 35.0\n",
      "      - 13.0\n",
      "      - 24.0\n",
      "      - 62.0\n",
      "      - 35.0\n",
      "      - 25.0\n",
      "      - 37.0\n",
      "      - 47.0\n",
      "      - 112.0\n",
      "      - 33.0\n",
      "      - 22.0\n",
      "      - 45.0\n",
      "      - 24.0\n",
      "      - 72.0\n",
      "      - 19.0\n",
      "      - 62.0\n",
      "      - 67.0\n",
      "      - 42.0\n",
      "      - 113.0\n",
      "      - 46.0\n",
      "      - 28.0\n",
      "      - 74.0\n",
      "      - 96.0\n",
      "      - 20.0\n",
      "      - 24.0\n",
      "      - 22.0\n",
      "      - 31.0\n",
      "      - 17.0\n",
      "      - 14.0\n",
      "      - 129.0\n",
      "      - 32.0\n",
      "      - 31.0\n",
      "      - 27.0\n",
      "      - 108.0\n",
      "      - 62.0\n",
      "      - 12.0\n",
      "      - 45.0\n",
      "      - 27.0\n",
      "      - 45.0\n",
      "      - 37.0\n",
      "      - 93.0\n",
      "      - 52.0\n",
      "      - 54.0\n",
      "      - 59.0\n",
      "      - 86.0\n",
      "      - 22.0\n",
      "      - 38.0\n",
      "      - 46.0\n",
      "      - 16.0\n",
      "      - 22.0\n",
      "      - 37.0\n",
      "      - 70.0\n",
      "      - 13.0\n",
      "      - 83.0\n",
      "      - 78.0\n",
      "      - 40.0\n",
      "      - 147.0\n",
      "      - 27.0\n",
      "      - 81.0\n",
      "      - 29.0\n",
      "      - 21.0\n",
      "      - 24.0\n",
      "      - 42.0\n",
      "      - 61.0\n",
      "      - 58.0\n",
      "      - 72.0\n",
      "      - 16.0\n",
      "      - 25.0\n",
      "      - 52.0\n",
      "      - 116.0\n",
      "      - 22.0\n",
      "      - 17.0\n",
      "      - 76.0\n",
      "      - 102.0\n",
      "      - 26.0\n",
      "      - 42.0\n",
      "      - 81.0\n",
      "      - 47.0\n",
      "      - 22.0\n",
      "      - 16.0\n",
      "      - 59.0\n",
      "      - 122.0\n",
      "      - 86.0\n",
      "      - 100.0\n",
      "      - 19.0\n",
      "      - 18.0\n",
      "      - 18.0\n",
      "      - 19.0\n",
      "      - 107.0\n",
      "    off_policy_estimator: {}\n",
      "    policy_reward_max: {}\n",
      "    policy_reward_mean: {}\n",
      "    policy_reward_min: {}\n",
      "    sampler_perf:\n",
      "      mean_action_processing_ms: 0.06236081053304994\n",
      "      mean_env_render_ms: 0.0\n",
      "      mean_env_wait_ms: 0.05041366691869162\n",
      "      mean_inference_ms: 0.5623494344695713\n",
      "      mean_raw_obs_processing_ms: 0.09146254327599868\n",
      "  time_since_restore: 11.58330774307251\n",
      "  time_this_iter_s: 3.6747536659240723\n",
      "  time_total_s: 11.58330774307251\n",
      "  timers:\n",
      "    learn_throughput: 2418.754\n",
      "    learn_time_ms: 1791.211\n",
      "    load_throughput: 15739993.14\n",
      "    load_time_ms: 0.275\n",
      "    training_iteration_time_ms: 5786.655\n",
      "    update_time_ms: 2.414\n",
      "  timestamp: 1653044293\n",
      "  timesteps_since_restore: 0\n",
      "  timesteps_total: 8665\n",
      "  training_iteration: 2\n",
      "  trial_id: ab506_00000\n",
      "  warmup_time: 9.528029203414917\n",
      "  \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-05-20 11:58:13,583\tINFO tune.py:753 -- Total run time: 32.49 seconds (31.86 seconds for the tuning loop).\n"
     ]
    }
   ],
   "source": [
    "ray.init(num_cpus=8)\n",
    "\n",
    "path = \"/tmp/out\"\n",
    "generate_offline_data(path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7534d5c",
   "metadata": {},
   "source": [
    "Then, we run training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f7aa671e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting offline training\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "== Status ==<br>Current time: 2022-05-20 11:58:39 (running for 00:00:25.89)<br>Memory usage on this node: 9.8/16.0 GiB<br>Using FIFO scheduling algorithm.<br>Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/4.13 GiB heap, 0.0/2.0 GiB objects<br>Result logdir: /Users/kai/ray_results/AIRBCTrainer_2022-05-20_11-58-14<br>Number of trials: 1/1 (1 TERMINATED)<br><table>\n",
       "<thead>\n",
       "<tr><th>Trial name              </th><th>status    </th><th>loc            </th><th style=\"text-align: right;\">  iter</th><th style=\"text-align: right;\">  total time (s)</th><th style=\"text-align: right;\">  ts</th><th style=\"text-align: right;\">  reward</th><th style=\"text-align: right;\">  episode_reward_max</th><th style=\"text-align: right;\">  episode_reward_min</th><th style=\"text-align: right;\">  episode_len_mean</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td>AIRBCTrainer_bef2c_00000</td><td>TERMINATED</td><td>127.0.0.1:28876</td><td style=\"text-align: right;\">     5</td><td style=\"text-align: right;\">            9.28</td><td style=\"text-align: right;\">2297</td><td style=\"text-align: right;\">     nan</td><td style=\"text-align: right;\">                 nan</td><td style=\"text-align: right;\">                 nan</td><td style=\"text-align: right;\">               nan</td></tr>\n",
       "</tbody>\n",
       "</table><br><br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-20 11:58:14,957\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=15 --runtime-env-hash=-2010331134\n",
      "\u001B[2m\u001B[36m(pid=28876)\u001B[0m 2022-05-20 11:58:21,630\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
      "\u001B[2m\u001B[36m(AIRBCTrainer pid=28876)\u001B[0m 2022-05-20 11:58:21,973\tINFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.\n",
      "\u001B[2m\u001B[36m(AIRBCTrainer pid=28876)\u001B[0m 2022-05-20 11:58:21,973\tWARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!\n",
      "\u001B[2m\u001B[36m(AIRBCTrainer pid=28876)\u001B[0m 2022-05-20 11:58:21,973\tINFO utils.py:241 -- No value for key `replay_batch_size` in replay_buffer_config. config['replay_buffer_config']['replay_batch_size'] will be automatically set to config['train_batch_size']\n",
      "\u001B[2m\u001B[36m(AIRBCTrainer pid=28876)\u001B[0m 2022-05-20 11:58:21,974\tINFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.\n",
      "Read:   0%|          | 0/2 [00:00<?, ?it/s]\n",
      "Read: 100%|██████████| 2/2 [00:00<00:00, 19.56it/s]\n",
      "Repartition: 100%|██████████| 2/2 [00:00<00:00, 42.83it/s]\n",
      "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-20 11:58:22,976\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134\n",
      "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-20 11:58:22,988\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331134\n",
      "\u001B[2m\u001B[36m(RolloutWorker pid=28883)\u001B[0m 2022-05-20 11:58:29,734\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
      "\u001B[2m\u001B[36m(RolloutWorker pid=28882)\u001B[0m 2022-05-20 11:58:29,734\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001B[2m\u001B[36m(RolloutWorker pid=28883)\u001B[0m DatasetReader  2  has  57  samples.\n",
      "\u001B[2m\u001B[36m(RolloutWorker pid=28882)\u001B[0m DatasetReader  1  has  57  samples.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001B[2m\u001B[36m(AIRBCTrainer pid=28876)\u001B[0m 2022-05-20 11:58:30,346\tWARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!\n",
      "\u001B[2m\u001B[36m(AIRBCTrainer pid=28876)\u001B[0m 2022-05-20 11:58:30,346\tWARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!\n",
      "\u001B[2m\u001B[36m(AIRBCTrainer pid=28876)\u001B[0m 2022-05-20 11:58:30,402\tWARNING util.py:65 -- Install gputil for GPU system monitoring.\n",
      "Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "Stage 0:   0%|          | 0/1 [00:00<?, ?it/s]\n",
      "\u001B[2m\u001B[33m(raylet)\u001B[0m 2022-05-20 11:58:31,224\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63962 --object-store-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-20_11-57-36_849562_28764/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=64061 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:64346 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331134\n",
      "\u001B[2m\u001B[36m(RolloutWorker pid=28893)\u001B[0m 2022-05-20 11:58:37,819\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result for AIRBCTrainer_bef2c_00000:\n",
      "  agent_timesteps_total: 445\n",
      "  counters:\n",
      "    num_agent_steps_sampled: 445\n",
      "    num_agent_steps_trained: 2000\n",
      "    num_env_steps_sampled: 445\n",
      "    num_env_steps_trained: 2000\n",
      "  custom_metrics: {}\n",
      "  date: 2022-05-20_11-58-38\n",
      "  done: false\n",
      "  episode_len_mean: .nan\n",
      "  episode_media: {}\n",
      "  episode_reward_max: .nan\n",
      "  episode_reward_mean: .nan\n",
      "  episode_reward_min: .nan\n",
      "  episodes_this_iter: 0\n",
      "  episodes_total: 0\n",
      "  evaluation:\n",
      "    custom_metrics: {}\n",
      "    episode_len_mean: 22.5\n",
      "    episode_media: {}\n",
      "    episode_reward_max: 54.0\n",
      "    episode_reward_mean: 22.5\n",
      "    episode_reward_min: 10.0\n",
      "    episodes_this_iter: 10\n",
      "    hist_stats:\n",
      "      episode_lengths:\n",
      "      - 30\n",
      "      - 10\n",
      "      - 18\n",
      "      - 54\n",
      "      - 31\n",
      "      - 14\n",
      "      - 18\n",
      "      - 16\n",
      "      - 11\n",
      "      - 23\n",
      "      episode_reward:\n",
      "      - 30.0\n",
      "      - 10.0\n",
      "      - 18.0\n",
      "      - 54.0\n",
      "      - 31.0\n",
      "      - 14.0\n",
      "      - 18.0\n",
      "      - 16.0\n",
      "      - 11.0\n",
      "      - 23.0\n",
      "    off_policy_estimator: {}\n",
      "    policy_reward_max: {}\n",
      "    policy_reward_mean: {}\n",
      "    policy_reward_min: {}\n",
      "    sampler_perf:\n",
      "      mean_action_processing_ms: 0.05497447157328107\n",
      "      mean_env_render_ms: 0.0\n",
      "      mean_env_wait_ms: 0.04451886742515902\n",
      "      mean_inference_ms: 0.4903911489301024\n",
      "      mean_raw_obs_processing_ms: 0.07444250900133521\n",
      "    timesteps_this_iter: 0\n",
      "  experiment_id: e44358ccdd9e498cbd98dd52e498c2fb\n",
      "  hostname: Kais-MacBook-Pro.local\n",
      "  info:\n",
      "    learner:\n",
      "      default_policy:\n",
      "        custom_metrics: {}\n",
      "        learner_stats:\n",
      "          model: {}\n",
      "          policy_loss: 0.6931660175323486\n",
      "          total_loss: 0.6931660175323486\n",
      "        num_agent_steps_trained: 2000.0\n",
      "    num_agent_steps_sampled: 445\n",
      "    num_agent_steps_trained: 2000\n",
      "    num_env_steps_sampled: 445\n",
      "    num_env_steps_trained: 2000\n",
      "  iterations_since_restore: 1\n",
      "  node_ip: 127.0.0.1\n",
      "  num_agent_steps_sampled: 445\n",
      "  num_agent_steps_trained: 2000\n",
      "  num_env_steps_sampled: 445\n",
      "  num_env_steps_sampled_this_iter: 445\n",
      "  num_env_steps_trained: 2000\n",
      "  num_env_steps_trained_this_iter: 2000\n",
      "  num_healthy_workers: 2\n",
      "  off_policy_estimator: {}\n",
      "  perf:\n",
      "    cpu_util_percent: 9.483333333333333\n",
      "    ram_util_percent: 60.383333333333326\n",
      "  pid: 28876\n",
      "  policy_reward_max: {}\n",
      "  policy_reward_mean: {}\n",
      "  policy_reward_min: {}\n",
      "  sampler_perf: {}\n",
      "  sampler_results:\n",
      "    custom_metrics: {}\n",
      "    episode_len_mean: .nan\n",
      "    episode_media: {}\n",
      "    episode_reward_max: .nan\n",
      "    episode_reward_mean: .nan\n",
      "    episode_reward_min: .nan\n",
      "    episodes_this_iter: 0\n",
      "    hist_stats:\n",
      "      episode_lengths: []\n",
      "      episode_reward: []\n",
      "    off_policy_estimator: {}\n",
      "    policy_reward_max: {}\n",
      "    policy_reward_mean: {}\n",
      "    policy_reward_min: {}\n",
      "    sampler_perf: {}\n",
      "  time_since_restore: 7.898306846618652\n",
      "  time_this_iter_s: 7.898306846618652\n",
      "  time_total_s: 7.898306846618652\n",
      "  timers:\n",
      "    learn_throughput: 21120.047\n",
      "    learn_time_ms: 94.697\n",
      "    load_throughput: 11881881.02\n",
      "    load_time_ms: 0.168\n",
      "    training_iteration_time_ms: 259.02\n",
      "    update_time_ms: 1.614\n",
      "  timestamp: 1653044318\n",
      "  timesteps_since_restore: 0\n",
      "  timesteps_total: 445\n",
      "  training_iteration: 1\n",
      "  trial_id: bef2c_00000\n",
      "  warmup_time: 8.44019627571106\n",
      "  \n",
      "Result for AIRBCTrainer_bef2c_00000:\n",
      "  agent_timesteps_total: 2297\n",
      "  counters:\n",
      "    num_agent_steps_sampled: 2297\n",
      "    num_agent_steps_trained: 10000\n",
      "    num_env_steps_sampled: 2297\n",
      "    num_env_steps_trained: 10000\n",
      "  custom_metrics: {}\n",
      "  date: 2022-05-20_11-58-39\n",
      "  done: true\n",
      "  episode_len_mean: .nan\n",
      "  episode_media: {}\n",
      "  episode_reward_max: .nan\n",
      "  episode_reward_mean: .nan\n",
      "  episode_reward_min: .nan\n",
      "  episodes_this_iter: 0\n",
      "  episodes_total: 0\n",
      "  evaluation:\n",
      "    custom_metrics: {}\n",
      "    episode_len_mean: 24.1\n",
      "    episode_media: {}\n",
      "    episode_reward_max: 43.0\n",
      "    episode_reward_mean: 24.1\n",
      "    episode_reward_min: 11.0\n",
      "    episodes_this_iter: 10\n",
      "    hist_stats:\n",
      "      episode_lengths:\n",
      "      - 11\n",
      "      - 19\n",
      "      - 27\n",
      "      - 43\n",
      "      - 33\n",
      "      - 18\n",
      "      - 19\n",
      "      - 35\n",
      "      - 15\n",
      "      - 21\n",
      "      episode_reward:\n",
      "      - 11.0\n",
      "      - 19.0\n",
      "      - 27.0\n",
      "      - 43.0\n",
      "      - 33.0\n",
      "      - 18.0\n",
      "      - 19.0\n",
      "      - 35.0\n",
      "      - 15.0\n",
      "      - 21.0\n",
      "    off_policy_estimator: {}\n",
      "    policy_reward_max: {}\n",
      "    policy_reward_mean: {}\n",
      "    policy_reward_min: {}\n",
      "    sampler_perf:\n",
      "      mean_action_processing_ms: 0.054491435182963496\n",
      "      mean_env_render_ms: 0.0\n",
      "      mean_env_wait_ms: 0.04467233881220088\n",
      "      mean_inference_ms: 0.4441456947045478\n",
      "      mean_raw_obs_processing_ms: 0.07285220421893394\n",
      "    timesteps_this_iter: 0\n",
      "  experiment_id: e44358ccdd9e498cbd98dd52e498c2fb\n",
      "  hostname: Kais-MacBook-Pro.local\n",
      "  info:\n",
      "    learner:\n",
      "      default_policy:\n",
      "        custom_metrics: {}\n",
      "        learner_stats:\n",
      "          model: {}\n",
      "          policy_loss: 0.6909552216529846\n",
      "          total_loss: 0.6909552216529846\n",
      "        num_agent_steps_trained: 2000.0\n",
      "    num_agent_steps_sampled: 2297\n",
      "    num_agent_steps_trained: 10000\n",
      "    num_env_steps_sampled: 2297\n",
      "    num_env_steps_trained: 10000\n",
      "  iterations_since_restore: 5\n",
      "  node_ip: 127.0.0.1\n",
      "  num_agent_steps_sampled: 2297\n",
      "  num_agent_steps_trained: 10000\n",
      "  num_env_steps_sampled: 2297\n",
      "  num_env_steps_sampled_this_iter: 493\n",
      "  num_env_steps_trained: 10000\n",
      "  num_env_steps_trained_this_iter: 2000\n",
      "  num_healthy_workers: 2\n",
      "  off_policy_estimator: {}\n",
      "  perf:\n",
      "    cpu_util_percent: 9.3\n",
      "    ram_util_percent: 61.3\n",
      "  pid: 28876\n",
      "  policy_reward_max: {}\n",
      "  policy_reward_mean: {}\n",
      "  policy_reward_min: {}\n",
      "  sampler_perf: {}\n",
      "  sampler_results:\n",
      "    custom_metrics: {}\n",
      "    episode_len_mean: .nan\n",
      "    episode_media: {}\n",
      "    episode_reward_max: .nan\n",
      "    episode_reward_mean: .nan\n",
      "    episode_reward_min: .nan\n",
      "    episodes_this_iter: 0\n",
      "    hist_stats:\n",
      "      episode_lengths: []\n",
      "      episode_reward: []\n",
      "    off_policy_estimator: {}\n",
      "    policy_reward_max: {}\n",
      "    policy_reward_mean: {}\n",
      "    policy_reward_min: {}\n",
      "    sampler_perf: {}\n",
      "  time_since_restore: 9.279996871948242\n",
      "  time_this_iter_s: 0.32008910179138184\n",
      "  time_total_s: 9.279996871948242\n",
      "  timers:\n",
      "    learn_throughput: 86954.351\n",
      "    learn_time_ms: 23.001\n",
      "    load_throughput: 11342087.615\n",
      "    load_time_ms: 0.176\n",
      "    training_iteration_time_ms: 194.49\n",
      "    update_time_ms: 1.59\n",
      "  timestamp: 1653044319\n",
      "  timesteps_since_restore: 0\n",
      "  timesteps_total: 2297\n",
      "  training_iteration: 5\n",
      "  trial_id: bef2c_00000\n",
      "  warmup_time: 8.44019627571106\n",
      "  \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-05-20 11:58:40,413\tINFO tune.py:753 -- Total run time: 26.38 seconds (25.84 seconds for the tuning loop).\n",
      "Read progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.78it/s]\n"
     ]
    }
   ],
   "source": [
    "result = train_rl_bc_offline(path=path, num_workers=2, use_gpu=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71d7f318",
   "metadata": {},
   "source": [
    "And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "53e412cc",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-05-20 11:58:40,636\tINFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.\n",
      "2022-05-20 11:58:40,637\tWARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!\n",
      "2022-05-20 11:58:40,637\tWARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!\n",
      "2022-05-20 11:58:40,638\tINFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.\n",
      "Read: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.58it/s]\n",
      "Repartition: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.84it/s]\n",
      "\u001B[2m\u001B[36m(RolloutWorker pid=28906)\u001B[0m 2022-05-20 11:58:49,326\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
      "\u001B[2m\u001B[36m(RolloutWorker pid=28907)\u001B[0m 2022-05-20 11:58:49,324\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001B[2m\u001B[36m(RolloutWorker pid=28906)\u001B[0m DatasetReader  1  has  57  samples.\n",
      "\u001B[2m\u001B[36m(RolloutWorker pid=28907)\u001B[0m DatasetReader  2  has  57  samples.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-05-20 11:58:49,953\tWARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!\n",
      "2022-05-20 11:58:49,954\tWARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!\n",
      "2022-05-20 11:58:50,013\tWARNING util.py:65 -- Install gputil for GPU system monitoring.\n",
      "2022-05-20 11:58:50,042\tINFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /Users/kai/ray_results/AIRBCTrainer_2022-05-20_11-58-14/AIRBCTrainer_bef2c_00000_0_2022-05-20_11-58-14/checkpoint_000005/checkpoint-5\n",
      "2022-05-20 11:58:50,043\tINFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 9.279996871948242, '_episodes_total': 0}\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average reward over 3 episodes: 41.333333333333336\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001B[2m\u001B[36m(RolloutWorker pid=28913)\u001B[0m 2022-05-20 11:58:56,934\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n"
     ]
    }
   ],
   "source": [
    "num_eval_episodes = 3\n",
    "\n",
    "rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)\n",
    "print(f\"Average reward over {num_eval_episodes} episodes: \" f\"{np.mean(rewards)}\")"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}