mirror of
https://github.com/vale981/ray
synced 2025-03-06 18:41:40 -05:00
1369 lines
44 KiB
Text
1369 lines
44 KiB
Text
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3471e19a",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Online reinforcement learning with Ray AIR\n",
|
|
"In this example, we'll train a reinforcement learning agent using online training.\n",
|
|
"\n",
|
|
"Online training means that the data from the environment is sampled while we are running the algorithm. In contrast, offline training uses data that has been stored on disk before."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f5083f08",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's start with installing our dependencies:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "01f914d2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install -qU \"ray[rllib]\" gym"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "980cea70",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we can run some imports:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "db0a45ff",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"2022-05-19 13:54:16,520\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
|
|
"2022-05-19 13:54:16,531\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.marwil` has been deprecated. Use `ray.rllib.algorithms.marwil` instead. This will raise an error in the future!\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import argparse\n",
|
|
"import gym\n",
|
|
"import os\n",
|
|
"\n",
|
|
"import numpy as np\n",
|
|
"import ray\n",
|
|
"from ray.air import Checkpoint\n",
|
|
"from ray.air.config import RunConfig\n",
|
|
"from ray.air.predictors.integrations.rl.rl_predictor import RLPredictor\n",
|
|
"from ray.air.train.integrations.rl.rl_trainer import RLTrainer\n",
|
|
"from ray.air.result import Result\n",
|
|
"from ray.rllib.agents.marwil import BCTrainer\n",
|
|
"from ray.tune.tuner import Tuner"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a13db7e4",
|
|
"metadata": {},
|
|
"source": [
|
|
"Here we define the training function. It will create an `RLTrainer` using the `PPO` algorithm and kick off training on the `CartPole-v0` environment:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "87fca4b1",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def train_rl_ppo_online(num_workers: int, use_gpu: bool = False) -> Result:\n",
|
|
" print(\"Starting online training\")\n",
|
|
" trainer = RLTrainer(\n",
|
|
" run_config=RunConfig(stop={\"training_iteration\": 5}),\n",
|
|
" scaling_config={\n",
|
|
" \"num_workers\": num_workers,\n",
|
|
" \"use_gpu\": use_gpu,\n",
|
|
" },\n",
|
|
" algorithm=\"PPO\",\n",
|
|
" config={\n",
|
|
" \"env\": \"CartPole-v0\",\n",
|
|
" \"framework\": \"tf\",\n",
|
|
" },\n",
|
|
" )\n",
|
|
" # Todo (krfricke/xwjiang): Enable checkpoint config in RunConfig\n",
|
|
" # result = trainer.fit()\n",
|
|
" tuner = Tuner(\n",
|
|
" trainer,\n",
|
|
" _tuner_kwargs={\"checkpoint_at_end\": True},\n",
|
|
" )\n",
|
|
" result = tuner.fit()[0]\n",
|
|
" return result"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f7a5d5c2",
|
|
"metadata": {},
|
|
"source": [
|
|
"Once we trained our RL policy, we want to evaluate it on a fresh environment. For this, we will also define a utility function:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "2628f3b0",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def evaluate_using_checkpoint(checkpoint: Checkpoint, num_episodes) -> list:\n",
|
|
" predictor = RLPredictor.from_checkpoint(checkpoint)\n",
|
|
"\n",
|
|
" env = gym.make(\"CartPole-v0\")\n",
|
|
"\n",
|
|
" rewards = []\n",
|
|
" for i in range(num_episodes):\n",
|
|
" obs = env.reset()\n",
|
|
" reward = 0.0\n",
|
|
" done = False\n",
|
|
" while not done:\n",
|
|
" action = predictor.predict([obs])\n",
|
|
" obs, r, done, _ = env.step(action[0])\n",
|
|
" reward += r\n",
|
|
" rewards.append(reward)\n",
|
|
"\n",
|
|
" return rewards"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d226d6aa",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's put it all together. First, we run training:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "cae1337e",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"2022-05-19 13:54:16,582\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.agents.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.agents.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future!\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Starting online training\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"2022-05-19 13:54:19,326\tINFO services.py:1483 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8267\u001b[39m\u001b[22m\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"== Status ==<br>Current time: 2022-05-19 13:54:57 (running for 00:00:35.99)<br>Memory usage on this node: 9.6/16.0 GiB<br>Using FIFO scheduling algorithm.<br>Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.54 GiB heap, 0.0/2.0 GiB objects<br>Result logdir: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16<br>Number of trials: 1/1 (1 TERMINATED)<br><table>\n",
|
|
"<thead>\n",
|
|
"<tr><th>Trial name </th><th>status </th><th>loc </th><th style=\"text-align: right;\"> iter</th><th style=\"text-align: right;\"> total time (s)</th><th style=\"text-align: right;\"> ts</th><th style=\"text-align: right;\"> reward</th><th style=\"text-align: right;\"> episode_reward_max</th><th style=\"text-align: right;\"> episode_reward_min</th><th style=\"text-align: right;\"> episode_len_mean</th></tr>\n",
|
|
"</thead>\n",
|
|
"<tbody>\n",
|
|
"<tr><td>AIRPPOTrainer_cd8d6_00000</td><td>TERMINATED</td><td>127.0.0.1:14174</td><td style=\"text-align: right;\"> 5</td><td style=\"text-align: right;\"> 16.7029</td><td style=\"text-align: right;\">20000</td><td style=\"text-align: right;\"> 124.79</td><td style=\"text-align: right;\"> 200</td><td style=\"text-align: right;\"> 9</td><td style=\"text-align: right;\"> 124.79</td></tr>\n",
|
|
"</tbody>\n",
|
|
"</table><br><br>"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.HTML object>"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\u001b[2m\u001b[33m(raylet)\u001b[0m 2022-05-19 13:54:23,061\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134\n",
|
|
"\u001b[2m\u001b[36m(pid=14174)\u001b[0m 2022-05-19 13:54:30,271\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
|
|
"\u001b[2m\u001b[36m(AIRPPOTrainer pid=14174)\u001b[0m 2022-05-19 13:54:30,749\tINFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.\n",
|
|
"\u001b[2m\u001b[36m(AIRPPOTrainer pid=14174)\u001b[0m 2022-05-19 13:54:30,750\tINFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.\n",
|
|
"\u001b[2m\u001b[36m(AIRPPOTrainer pid=14174)\u001b[0m 2022-05-19 13:54:30,750\tINFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.\n",
|
|
"\u001b[2m\u001b[33m(raylet)\u001b[0m 2022-05-19 13:54:31,857\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=17 --runtime-env-hash=-2010331134\n",
|
|
"\u001b[2m\u001b[33m(raylet)\u001b[0m 2022-05-19 13:54:31,857\tINFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=63729 --object-store-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_13-54-16_649144_14093/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=63909 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:65260 --redis-password=5241590000000000 --startup-token=18 --runtime-env-hash=-2010331134\n",
|
|
"\u001b[2m\u001b[36m(RolloutWorker pid=14179)\u001b[0m 2022-05-19 13:54:39,442\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
|
|
"\u001b[2m\u001b[36m(RolloutWorker pid=14180)\u001b[0m 2022-05-19 13:54:39,492\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
|
|
"\u001b[2m\u001b[36m(AIRPPOTrainer pid=14174)\u001b[0m 2022-05-19 13:54:40,836\tINFO trainable.py:163 -- Trainable.setup took 10.087 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.\n",
|
|
"\u001b[2m\u001b[36m(AIRPPOTrainer pid=14174)\u001b[0m 2022-05-19 13:54:40,836\tWARNING util.py:65 -- Install gputil for GPU system monitoring.\n",
|
|
"\u001b[2m\u001b[36m(AIRPPOTrainer pid=14174)\u001b[0m 2022-05-19 13:54:42,569\tWARNING deprecation.py:47 -- DeprecationWarning: `slice` has been deprecated. Use `SampleBatch[start:stop]` instead. This will raise an error in the future!\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Result for AIRPPOTrainer_cd8d6_00000:\n",
|
|
" agent_timesteps_total: 4000\n",
|
|
" counters:\n",
|
|
" num_agent_steps_sampled: 4000\n",
|
|
" num_agent_steps_trained: 4000\n",
|
|
" num_env_steps_sampled: 4000\n",
|
|
" num_env_steps_trained: 4000\n",
|
|
" custom_metrics: {}\n",
|
|
" date: 2022-05-19_13-54-44\n",
|
|
" done: false\n",
|
|
" episode_len_mean: 22.11731843575419\n",
|
|
" episode_media: {}\n",
|
|
" episode_reward_max: 87.0\n",
|
|
" episode_reward_mean: 22.11731843575419\n",
|
|
" episode_reward_min: 8.0\n",
|
|
" episodes_this_iter: 179\n",
|
|
" episodes_total: 179\n",
|
|
" experiment_id: 158c57d8b6e142ad85b393db57c8bdff\n",
|
|
" hostname: Kais-MacBook-Pro.local\n",
|
|
" info:\n",
|
|
" learner:\n",
|
|
" default_policy:\n",
|
|
" custom_metrics: {}\n",
|
|
" learner_stats:\n",
|
|
" cur_kl_coeff: 0.20000000298023224\n",
|
|
" cur_lr: 4.999999873689376e-05\n",
|
|
" entropy: 0.6653298139572144\n",
|
|
" entropy_coeff: 0.0\n",
|
|
" kl: 0.02798665314912796\n",
|
|
" model: {}\n",
|
|
" policy_loss: -0.0422092080116272\n",
|
|
" total_loss: 8.986403465270996\n",
|
|
" vf_explained_var: -0.06533512473106384\n",
|
|
" vf_loss: 9.023015022277832\n",
|
|
" num_agent_steps_trained: 128.0\n",
|
|
" num_agent_steps_sampled: 4000\n",
|
|
" num_agent_steps_trained: 4000\n",
|
|
" num_env_steps_sampled: 4000\n",
|
|
" num_env_steps_trained: 4000\n",
|
|
" iterations_since_restore: 1\n",
|
|
" node_ip: 127.0.0.1\n",
|
|
" num_agent_steps_sampled: 4000\n",
|
|
" num_agent_steps_trained: 4000\n",
|
|
" num_env_steps_sampled: 4000\n",
|
|
" num_env_steps_sampled_this_iter: 4000\n",
|
|
" num_env_steps_trained: 4000\n",
|
|
" num_env_steps_trained_this_iter: 4000\n",
|
|
" num_healthy_workers: 2\n",
|
|
" off_policy_estimator: {}\n",
|
|
" perf:\n",
|
|
" cpu_util_percent: 24.849999999999998\n",
|
|
" ram_util_percent: 61.199999999999996\n",
|
|
" pid: 14174\n",
|
|
" policy_reward_max: {}\n",
|
|
" policy_reward_mean: {}\n",
|
|
" policy_reward_min: {}\n",
|
|
" sampler_perf:\n",
|
|
" mean_action_processing_ms: 0.06886580197141673\n",
|
|
" mean_env_render_ms: 0.0\n",
|
|
" mean_env_wait_ms: 0.05465748139159193\n",
|
|
" mean_inference_ms: 0.6132523881103351\n",
|
|
" mean_raw_obs_processing_ms: 0.10609273714105154\n",
|
|
" sampler_results:\n",
|
|
" custom_metrics: {}\n",
|
|
" episode_len_mean: 22.11731843575419\n",
|
|
" episode_media: {}\n",
|
|
" episode_reward_max: 87.0\n",
|
|
" episode_reward_mean: 22.11731843575419\n",
|
|
" episode_reward_min: 8.0\n",
|
|
" episodes_this_iter: 179\n",
|
|
" hist_stats:\n",
|
|
" episode_lengths:\n",
|
|
" - 28\n",
|
|
" - 9\n",
|
|
" - 12\n",
|
|
" - 23\n",
|
|
" - 13\n",
|
|
" - 21\n",
|
|
" - 15\n",
|
|
" - 16\n",
|
|
" - 19\n",
|
|
" - 44\n",
|
|
" - 14\n",
|
|
" - 19\n",
|
|
" - 19\n",
|
|
" - 17\n",
|
|
" - 17\n",
|
|
" - 12\n",
|
|
" - 9\n",
|
|
" - 48\n",
|
|
" - 43\n",
|
|
" - 15\n",
|
|
" - 21\n",
|
|
" - 25\n",
|
|
" - 16\n",
|
|
" - 14\n",
|
|
" - 22\n",
|
|
" - 21\n",
|
|
" - 24\n",
|
|
" - 53\n",
|
|
" - 21\n",
|
|
" - 16\n",
|
|
" - 17\n",
|
|
" - 14\n",
|
|
" - 20\n",
|
|
" - 22\n",
|
|
" - 18\n",
|
|
" - 17\n",
|
|
" - 14\n",
|
|
" - 11\n",
|
|
" - 46\n",
|
|
" - 12\n",
|
|
" - 18\n",
|
|
" - 21\n",
|
|
" - 13\n",
|
|
" - 58\n",
|
|
" - 10\n",
|
|
" - 20\n",
|
|
" - 14\n",
|
|
" - 25\n",
|
|
" - 22\n",
|
|
" - 33\n",
|
|
" - 23\n",
|
|
" - 10\n",
|
|
" - 25\n",
|
|
" - 11\n",
|
|
" - 32\n",
|
|
" - 48\n",
|
|
" - 12\n",
|
|
" - 12\n",
|
|
" - 10\n",
|
|
" - 24\n",
|
|
" - 15\n",
|
|
" - 28\n",
|
|
" - 14\n",
|
|
" - 16\n",
|
|
" - 14\n",
|
|
" - 21\n",
|
|
" - 12\n",
|
|
" - 13\n",
|
|
" - 8\n",
|
|
" - 12\n",
|
|
" - 13\n",
|
|
" - 10\n",
|
|
" - 10\n",
|
|
" - 14\n",
|
|
" - 30\n",
|
|
" - 16\n",
|
|
" - 23\n",
|
|
" - 47\n",
|
|
" - 14\n",
|
|
" - 22\n",
|
|
" - 11\n",
|
|
" - 18\n",
|
|
" - 12\n",
|
|
" - 21\n",
|
|
" - 21\n",
|
|
" - 20\n",
|
|
" - 18\n",
|
|
" - 29\n",
|
|
" - 18\n",
|
|
" - 24\n",
|
|
" - 50\n",
|
|
" - 87\n",
|
|
" - 21\n",
|
|
" - 41\n",
|
|
" - 21\n",
|
|
" - 34\n",
|
|
" - 47\n",
|
|
" - 20\n",
|
|
" - 26\n",
|
|
" - 14\n",
|
|
" - 9\n",
|
|
" - 24\n",
|
|
" - 16\n",
|
|
" - 18\n",
|
|
" - 44\n",
|
|
" - 28\n",
|
|
" - 37\n",
|
|
" - 10\n",
|
|
" - 19\n",
|
|
" - 11\n",
|
|
" - 56\n",
|
|
" - 11\n",
|
|
" - 28\n",
|
|
" - 16\n",
|
|
" - 14\n",
|
|
" - 19\n",
|
|
" - 23\n",
|
|
" - 11\n",
|
|
" - 22\n",
|
|
" - 63\n",
|
|
" - 22\n",
|
|
" - 13\n",
|
|
" - 29\n",
|
|
" - 11\n",
|
|
" - 64\n",
|
|
" - 44\n",
|
|
" - 45\n",
|
|
" - 38\n",
|
|
" - 17\n",
|
|
" - 18\n",
|
|
" - 21\n",
|
|
" - 13\n",
|
|
" - 12\n",
|
|
" - 13\n",
|
|
" - 10\n",
|
|
" - 17\n",
|
|
" - 14\n",
|
|
" - 16\n",
|
|
" - 10\n",
|
|
" - 19\n",
|
|
" - 25\n",
|
|
" - 15\n",
|
|
" - 50\n",
|
|
" - 13\n",
|
|
" - 10\n",
|
|
" - 15\n",
|
|
" - 12\n",
|
|
" - 15\n",
|
|
" - 11\n",
|
|
" - 14\n",
|
|
" - 17\n",
|
|
" - 17\n",
|
|
" - 14\n",
|
|
" - 49\n",
|
|
" - 18\n",
|
|
" - 13\n",
|
|
" - 28\n",
|
|
" - 31\n",
|
|
" - 19\n",
|
|
" - 26\n",
|
|
" - 31\n",
|
|
" - 29\n",
|
|
" - 21\n",
|
|
" - 23\n",
|
|
" - 17\n",
|
|
" - 23\n",
|
|
" - 32\n",
|
|
" - 35\n",
|
|
" - 10\n",
|
|
" - 11\n",
|
|
" - 30\n",
|
|
" - 21\n",
|
|
" - 16\n",
|
|
" - 15\n",
|
|
" - 23\n",
|
|
" - 40\n",
|
|
" - 24\n",
|
|
" - 24\n",
|
|
" - 14\n",
|
|
" episode_reward:\n",
|
|
" - 28.0\n",
|
|
" - 9.0\n",
|
|
" - 12.0\n",
|
|
" - 23.0\n",
|
|
" - 13.0\n",
|
|
" - 21.0\n",
|
|
" - 15.0\n",
|
|
" - 16.0\n",
|
|
" - 19.0\n",
|
|
" - 44.0\n",
|
|
" - 14.0\n",
|
|
" - 19.0\n",
|
|
" - 19.0\n",
|
|
" - 17.0\n",
|
|
" - 17.0\n",
|
|
" - 12.0\n",
|
|
" - 9.0\n",
|
|
" - 48.0\n",
|
|
" - 43.0\n",
|
|
" - 15.0\n",
|
|
" - 21.0\n",
|
|
" - 25.0\n",
|
|
" - 16.0\n",
|
|
" - 14.0\n",
|
|
" - 22.0\n",
|
|
" - 21.0\n",
|
|
" - 24.0\n",
|
|
" - 53.0\n",
|
|
" - 21.0\n",
|
|
" - 16.0\n",
|
|
" - 17.0\n",
|
|
" - 14.0\n",
|
|
" - 20.0\n",
|
|
" - 22.0\n",
|
|
" - 18.0\n",
|
|
" - 17.0\n",
|
|
" - 14.0\n",
|
|
" - 11.0\n",
|
|
" - 46.0\n",
|
|
" - 12.0\n",
|
|
" - 18.0\n",
|
|
" - 21.0\n",
|
|
" - 13.0\n",
|
|
" - 58.0\n",
|
|
" - 10.0\n",
|
|
" - 20.0\n",
|
|
" - 14.0\n",
|
|
" - 25.0\n",
|
|
" - 22.0\n",
|
|
" - 33.0\n",
|
|
" - 23.0\n",
|
|
" - 10.0\n",
|
|
" - 25.0\n",
|
|
" - 11.0\n",
|
|
" - 32.0\n",
|
|
" - 48.0\n",
|
|
" - 12.0\n",
|
|
" - 12.0\n",
|
|
" - 10.0\n",
|
|
" - 24.0\n",
|
|
" - 15.0\n",
|
|
" - 28.0\n",
|
|
" - 14.0\n",
|
|
" - 16.0\n",
|
|
" - 14.0\n",
|
|
" - 21.0\n",
|
|
" - 12.0\n",
|
|
" - 13.0\n",
|
|
" - 8.0\n",
|
|
" - 12.0\n",
|
|
" - 13.0\n",
|
|
" - 10.0\n",
|
|
" - 10.0\n",
|
|
" - 14.0\n",
|
|
" - 30.0\n",
|
|
" - 16.0\n",
|
|
" - 23.0\n",
|
|
" - 47.0\n",
|
|
" - 14.0\n",
|
|
" - 22.0\n",
|
|
" - 11.0\n",
|
|
" - 18.0\n",
|
|
" - 12.0\n",
|
|
" - 21.0\n",
|
|
" - 21.0\n",
|
|
" - 20.0\n",
|
|
" - 18.0\n",
|
|
" - 29.0\n",
|
|
" - 18.0\n",
|
|
" - 24.0\n",
|
|
" - 50.0\n",
|
|
" - 87.0\n",
|
|
" - 21.0\n",
|
|
" - 41.0\n",
|
|
" - 21.0\n",
|
|
" - 34.0\n",
|
|
" - 47.0\n",
|
|
" - 20.0\n",
|
|
" - 26.0\n",
|
|
" - 14.0\n",
|
|
" - 9.0\n",
|
|
" - 24.0\n",
|
|
" - 16.0\n",
|
|
" - 18.0\n",
|
|
" - 44.0\n",
|
|
" - 28.0\n",
|
|
" - 37.0\n",
|
|
" - 10.0\n",
|
|
" - 19.0\n",
|
|
" - 11.0\n",
|
|
" - 56.0\n",
|
|
" - 11.0\n",
|
|
" - 28.0\n",
|
|
" - 16.0\n",
|
|
" - 14.0\n",
|
|
" - 19.0\n",
|
|
" - 23.0\n",
|
|
" - 11.0\n",
|
|
" - 22.0\n",
|
|
" - 63.0\n",
|
|
" - 22.0\n",
|
|
" - 13.0\n",
|
|
" - 29.0\n",
|
|
" - 11.0\n",
|
|
" - 64.0\n",
|
|
" - 44.0\n",
|
|
" - 45.0\n",
|
|
" - 38.0\n",
|
|
" - 17.0\n",
|
|
" - 18.0\n",
|
|
" - 21.0\n",
|
|
" - 13.0\n",
|
|
" - 12.0\n",
|
|
" - 13.0\n",
|
|
" - 10.0\n",
|
|
" - 17.0\n",
|
|
" - 14.0\n",
|
|
" - 16.0\n",
|
|
" - 10.0\n",
|
|
" - 19.0\n",
|
|
" - 25.0\n",
|
|
" - 15.0\n",
|
|
" - 50.0\n",
|
|
" - 13.0\n",
|
|
" - 10.0\n",
|
|
" - 15.0\n",
|
|
" - 12.0\n",
|
|
" - 15.0\n",
|
|
" - 11.0\n",
|
|
" - 14.0\n",
|
|
" - 17.0\n",
|
|
" - 17.0\n",
|
|
" - 14.0\n",
|
|
" - 49.0\n",
|
|
" - 18.0\n",
|
|
" - 13.0\n",
|
|
" - 28.0\n",
|
|
" - 31.0\n",
|
|
" - 19.0\n",
|
|
" - 26.0\n",
|
|
" - 31.0\n",
|
|
" - 29.0\n",
|
|
" - 21.0\n",
|
|
" - 23.0\n",
|
|
" - 17.0\n",
|
|
" - 23.0\n",
|
|
" - 32.0\n",
|
|
" - 35.0\n",
|
|
" - 10.0\n",
|
|
" - 11.0\n",
|
|
" - 30.0\n",
|
|
" - 21.0\n",
|
|
" - 16.0\n",
|
|
" - 15.0\n",
|
|
" - 23.0\n",
|
|
" - 40.0\n",
|
|
" - 24.0\n",
|
|
" - 24.0\n",
|
|
" - 14.0\n",
|
|
" off_policy_estimator: {}\n",
|
|
" policy_reward_max: {}\n",
|
|
" policy_reward_mean: {}\n",
|
|
" policy_reward_min: {}\n",
|
|
" sampler_perf:\n",
|
|
" mean_action_processing_ms: 0.06886580197141673\n",
|
|
" mean_env_render_ms: 0.0\n",
|
|
" mean_env_wait_ms: 0.05465748139159193\n",
|
|
" mean_inference_ms: 0.6132523881103351\n",
|
|
" mean_raw_obs_processing_ms: 0.10609273714105154\n",
|
|
" time_since_restore: 3.7304069995880127\n",
|
|
" time_this_iter_s: 3.7304069995880127\n",
|
|
" time_total_s: 3.7304069995880127\n",
|
|
" timers:\n",
|
|
" learn_throughput: 2006.2\n",
|
|
" learn_time_ms: 1993.819\n",
|
|
" load_throughput: 24708712.813\n",
|
|
" load_time_ms: 0.162\n",
|
|
" training_iteration_time_ms: 3726.731\n",
|
|
" update_time_ms: 1.95\n",
|
|
" timestamp: 1652964884\n",
|
|
" timesteps_since_restore: 0\n",
|
|
" timesteps_total: 4000\n",
|
|
" training_iteration: 1\n",
|
|
" trial_id: cd8d6_00000\n",
|
|
" warmup_time: 10.095139741897583\n",
|
|
" \n",
|
|
"Result for AIRPPOTrainer_cd8d6_00000:\n",
|
|
" agent_timesteps_total: 12000\n",
|
|
" counters:\n",
|
|
" num_agent_steps_sampled: 12000\n",
|
|
" num_agent_steps_trained: 12000\n",
|
|
" num_env_steps_sampled: 12000\n",
|
|
" num_env_steps_trained: 12000\n",
|
|
" custom_metrics: {}\n",
|
|
" date: 2022-05-19_13-54-51\n",
|
|
" done: false\n",
|
|
" episode_len_mean: 65.15\n",
|
|
" episode_media: {}\n",
|
|
" episode_reward_max: 200.0\n",
|
|
" episode_reward_mean: 65.15\n",
|
|
" episode_reward_min: 9.0\n",
|
|
" episodes_this_iter: 44\n",
|
|
" episodes_total: 311\n",
|
|
" experiment_id: 158c57d8b6e142ad85b393db57c8bdff\n",
|
|
" hostname: Kais-MacBook-Pro.local\n",
|
|
" info:\n",
|
|
" learner:\n",
|
|
" default_policy:\n",
|
|
" custom_metrics: {}\n",
|
|
" learner_stats:\n",
|
|
" cur_kl_coeff: 0.30000001192092896\n",
|
|
" cur_lr: 4.999999873689376e-05\n",
|
|
" entropy: 0.5750519633293152\n",
|
|
" entropy_coeff: 0.0\n",
|
|
" kl: 0.012749233283102512\n",
|
|
" model: {}\n",
|
|
" policy_loss: -0.026830431073904037\n",
|
|
" total_loss: 9.414541244506836\n",
|
|
" vf_explained_var: 0.046859823167324066\n",
|
|
" vf_loss: 9.43754768371582\n",
|
|
" num_agent_steps_trained: 128.0\n",
|
|
" num_agent_steps_sampled: 12000\n",
|
|
" num_agent_steps_trained: 12000\n",
|
|
" num_env_steps_sampled: 12000\n",
|
|
" num_env_steps_trained: 12000\n",
|
|
" iterations_since_restore: 3\n",
|
|
" node_ip: 127.0.0.1\n",
|
|
" num_agent_steps_sampled: 12000\n",
|
|
" num_agent_steps_trained: 12000\n",
|
|
" num_env_steps_sampled: 12000\n",
|
|
" num_env_steps_sampled_this_iter: 4000\n",
|
|
" num_env_steps_trained: 12000\n",
|
|
" num_env_steps_trained_this_iter: 4000\n",
|
|
" num_healthy_workers: 2\n",
|
|
" off_policy_estimator: {}\n",
|
|
" perf:\n",
|
|
" cpu_util_percent: 20.9\n",
|
|
" ram_util_percent: 61.379999999999995\n",
|
|
" pid: 14174\n",
|
|
" policy_reward_max: {}\n",
|
|
" policy_reward_mean: {}\n",
|
|
" policy_reward_min: {}\n",
|
|
" sampler_perf:\n",
|
|
" mean_action_processing_ms: 0.06834399059626647\n",
|
|
" mean_env_render_ms: 0.0\n",
|
|
" mean_env_wait_ms: 0.05423359203664157\n",
|
|
" mean_inference_ms: 0.5997818239241897\n",
|
|
" mean_raw_obs_processing_ms: 0.0982917359628421\n",
|
|
" sampler_results:\n",
|
|
" custom_metrics: {}\n",
|
|
" episode_len_mean: 65.15\n",
|
|
" episode_media: {}\n",
|
|
" episode_reward_max: 200.0\n",
|
|
" episode_reward_mean: 65.15\n",
|
|
" episode_reward_min: 9.0\n",
|
|
" episodes_this_iter: 44\n",
|
|
" hist_stats:\n",
|
|
" episode_lengths:\n",
|
|
" - 34\n",
|
|
" - 37\n",
|
|
" - 38\n",
|
|
" - 23\n",
|
|
" - 29\n",
|
|
" - 56\n",
|
|
" - 38\n",
|
|
" - 13\n",
|
|
" - 10\n",
|
|
" - 18\n",
|
|
" - 40\n",
|
|
" - 23\n",
|
|
" - 46\n",
|
|
" - 84\n",
|
|
" - 29\n",
|
|
" - 44\n",
|
|
" - 54\n",
|
|
" - 32\n",
|
|
" - 30\n",
|
|
" - 100\n",
|
|
" - 28\n",
|
|
" - 67\n",
|
|
" - 47\n",
|
|
" - 40\n",
|
|
" - 74\n",
|
|
" - 133\n",
|
|
" - 32\n",
|
|
" - 28\n",
|
|
" - 86\n",
|
|
" - 133\n",
|
|
" - 46\n",
|
|
" - 60\n",
|
|
" - 17\n",
|
|
" - 43\n",
|
|
" - 12\n",
|
|
" - 51\n",
|
|
" - 57\n",
|
|
" - 70\n",
|
|
" - 54\n",
|
|
" - 73\n",
|
|
" - 16\n",
|
|
" - 29\n",
|
|
" - 113\n",
|
|
" - 45\n",
|
|
" - 31\n",
|
|
" - 44\n",
|
|
" - 103\n",
|
|
" - 62\n",
|
|
" - 72\n",
|
|
" - 20\n",
|
|
" - 15\n",
|
|
" - 35\n",
|
|
" - 12\n",
|
|
" - 9\n",
|
|
" - 24\n",
|
|
" - 10\n",
|
|
" - 102\n",
|
|
" - 93\n",
|
|
" - 73\n",
|
|
" - 27\n",
|
|
" - 52\n",
|
|
" - 144\n",
|
|
" - 19\n",
|
|
" - 140\n",
|
|
" - 91\n",
|
|
" - 133\n",
|
|
" - 147\n",
|
|
" - 140\n",
|
|
" - 90\n",
|
|
" - 14\n",
|
|
" - 73\n",
|
|
" - 71\n",
|
|
" - 200\n",
|
|
" - 55\n",
|
|
" - 184\n",
|
|
" - 103\n",
|
|
" - 196\n",
|
|
" - 168\n",
|
|
" - 177\n",
|
|
" - 38\n",
|
|
" - 33\n",
|
|
" - 50\n",
|
|
" - 149\n",
|
|
" - 67\n",
|
|
" - 87\n",
|
|
" - 25\n",
|
|
" - 134\n",
|
|
" - 42\n",
|
|
" - 26\n",
|
|
" - 24\n",
|
|
" - 121\n",
|
|
" - 61\n",
|
|
" - 109\n",
|
|
" - 19\n",
|
|
" - 200\n",
|
|
" - 60\n",
|
|
" - 40\n",
|
|
" - 51\n",
|
|
" - 88\n",
|
|
" - 30\n",
|
|
" episode_reward:\n",
|
|
" - 34.0\n",
|
|
" - 37.0\n",
|
|
" - 38.0\n",
|
|
" - 23.0\n",
|
|
" - 29.0\n",
|
|
" - 56.0\n",
|
|
" - 38.0\n",
|
|
" - 13.0\n",
|
|
" - 10.0\n",
|
|
" - 18.0\n",
|
|
" - 40.0\n",
|
|
" - 23.0\n",
|
|
" - 46.0\n",
|
|
" - 84.0\n",
|
|
" - 29.0\n",
|
|
" - 44.0\n",
|
|
" - 54.0\n",
|
|
" - 32.0\n",
|
|
" - 30.0\n",
|
|
" - 100.0\n",
|
|
" - 28.0\n",
|
|
" - 67.0\n",
|
|
" - 47.0\n",
|
|
" - 40.0\n",
|
|
" - 74.0\n",
|
|
" - 133.0\n",
|
|
" - 32.0\n",
|
|
" - 28.0\n",
|
|
" - 86.0\n",
|
|
" - 133.0\n",
|
|
" - 46.0\n",
|
|
" - 60.0\n",
|
|
" - 17.0\n",
|
|
" - 43.0\n",
|
|
" - 12.0\n",
|
|
" - 51.0\n",
|
|
" - 57.0\n",
|
|
" - 70.0\n",
|
|
" - 54.0\n",
|
|
" - 73.0\n",
|
|
" - 16.0\n",
|
|
" - 29.0\n",
|
|
" - 113.0\n",
|
|
" - 45.0\n",
|
|
" - 31.0\n",
|
|
" - 44.0\n",
|
|
" - 103.0\n",
|
|
" - 62.0\n",
|
|
" - 72.0\n",
|
|
" - 20.0\n",
|
|
" - 15.0\n",
|
|
" - 35.0\n",
|
|
" - 12.0\n",
|
|
" - 9.0\n",
|
|
" - 24.0\n",
|
|
" - 10.0\n",
|
|
" - 102.0\n",
|
|
" - 93.0\n",
|
|
" - 73.0\n",
|
|
" - 27.0\n",
|
|
" - 52.0\n",
|
|
" - 144.0\n",
|
|
" - 19.0\n",
|
|
" - 140.0\n",
|
|
" - 91.0\n",
|
|
" - 133.0\n",
|
|
" - 147.0\n",
|
|
" - 140.0\n",
|
|
" - 90.0\n",
|
|
" - 14.0\n",
|
|
" - 73.0\n",
|
|
" - 71.0\n",
|
|
" - 200.0\n",
|
|
" - 55.0\n",
|
|
" - 184.0\n",
|
|
" - 103.0\n",
|
|
" - 196.0\n",
|
|
" - 168.0\n",
|
|
" - 177.0\n",
|
|
" - 38.0\n",
|
|
" - 33.0\n",
|
|
" - 50.0\n",
|
|
" - 149.0\n",
|
|
" - 67.0\n",
|
|
" - 87.0\n",
|
|
" - 25.0\n",
|
|
" - 134.0\n",
|
|
" - 42.0\n",
|
|
" - 26.0\n",
|
|
" - 24.0\n",
|
|
" - 121.0\n",
|
|
" - 61.0\n",
|
|
" - 109.0\n",
|
|
" - 19.0\n",
|
|
" - 200.0\n",
|
|
" - 60.0\n",
|
|
" - 40.0\n",
|
|
" - 51.0\n",
|
|
" - 88.0\n",
|
|
" - 30.0\n",
|
|
" off_policy_estimator: {}\n",
|
|
" policy_reward_max: {}\n",
|
|
" policy_reward_mean: {}\n",
|
|
" policy_reward_min: {}\n",
|
|
" sampler_perf:\n",
|
|
" mean_action_processing_ms: 0.06834399059626647\n",
|
|
" mean_env_render_ms: 0.0\n",
|
|
" mean_env_wait_ms: 0.05423359203664157\n",
|
|
" mean_inference_ms: 0.5997818239241897\n",
|
|
" mean_raw_obs_processing_ms: 0.0982917359628421\n",
|
|
" time_since_restore: 10.289561986923218\n",
|
|
" time_this_iter_s: 3.3495230674743652\n",
|
|
" time_total_s: 10.289561986923218\n",
|
|
" timers:\n",
|
|
" learn_throughput: 2276.977\n",
|
|
" learn_time_ms: 1756.715\n",
|
|
" load_throughput: 20798201.653\n",
|
|
" load_time_ms: 0.192\n",
|
|
" training_iteration_time_ms: 3425.704\n",
|
|
" update_time_ms: 1.814\n",
|
|
" timestamp: 1652964891\n",
|
|
" timesteps_since_restore: 0\n",
|
|
" timesteps_total: 12000\n",
|
|
" training_iteration: 3\n",
|
|
" trial_id: cd8d6_00000\n",
|
|
" warmup_time: 10.095139741897583\n",
|
|
" \n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Result for AIRPPOTrainer_cd8d6_00000:\n",
|
|
" agent_timesteps_total: 20000\n",
|
|
" counters:\n",
|
|
" num_agent_steps_sampled: 20000\n",
|
|
" num_agent_steps_trained: 20000\n",
|
|
" num_env_steps_sampled: 20000\n",
|
|
" num_env_steps_trained: 20000\n",
|
|
" custom_metrics: {}\n",
|
|
" date: 2022-05-19_13-54-57\n",
|
|
" done: true\n",
|
|
" episode_len_mean: 124.79\n",
|
|
" episode_media: {}\n",
|
|
" episode_reward_max: 200.0\n",
|
|
" episode_reward_mean: 124.79\n",
|
|
" episode_reward_min: 9.0\n",
|
|
" episodes_this_iter: 20\n",
|
|
" episodes_total: 354\n",
|
|
" experiment_id: 158c57d8b6e142ad85b393db57c8bdff\n",
|
|
" hostname: Kais-MacBook-Pro.local\n",
|
|
" info:\n",
|
|
" learner:\n",
|
|
" default_policy:\n",
|
|
" custom_metrics: {}\n",
|
|
" learner_stats:\n",
|
|
" cur_kl_coeff: 0.30000001192092896\n",
|
|
" cur_lr: 4.999999873689376e-05\n",
|
|
" entropy: 0.5436986684799194\n",
|
|
" entropy_coeff: 0.0\n",
|
|
" kl: 0.0034858626313507557\n",
|
|
" model: {}\n",
|
|
" policy_loss: -0.012989979237318039\n",
|
|
" total_loss: 9.49295425415039\n",
|
|
" vf_explained_var: 0.025460055097937584\n",
|
|
" vf_loss: 9.504897117614746\n",
|
|
" num_agent_steps_trained: 128.0\n",
|
|
" num_agent_steps_sampled: 20000\n",
|
|
" num_agent_steps_trained: 20000\n",
|
|
" num_env_steps_sampled: 20000\n",
|
|
" num_env_steps_trained: 20000\n",
|
|
" iterations_since_restore: 5\n",
|
|
" node_ip: 127.0.0.1\n",
|
|
" num_agent_steps_sampled: 20000\n",
|
|
" num_agent_steps_trained: 20000\n",
|
|
" num_env_steps_sampled: 20000\n",
|
|
" num_env_steps_sampled_this_iter: 4000\n",
|
|
" num_env_steps_trained: 20000\n",
|
|
" num_env_steps_trained_this_iter: 4000\n",
|
|
" num_healthy_workers: 2\n",
|
|
" off_policy_estimator: {}\n",
|
|
" perf:\n",
|
|
" cpu_util_percent: 24.599999999999998\n",
|
|
" ram_util_percent: 59.775\n",
|
|
" pid: 14174\n",
|
|
" policy_reward_max: {}\n",
|
|
" policy_reward_mean: {}\n",
|
|
" policy_reward_min: {}\n",
|
|
" sampler_perf:\n",
|
|
" mean_action_processing_ms: 0.06817872750804764\n",
|
|
" mean_env_render_ms: 0.0\n",
|
|
" mean_env_wait_ms: 0.05424549075766555\n",
|
|
" mean_inference_ms: 0.5976919122059019\n",
|
|
" mean_raw_obs_processing_ms: 0.09603803519062176\n",
|
|
" sampler_results:\n",
|
|
" custom_metrics: {}\n",
|
|
" episode_len_mean: 124.79\n",
|
|
" episode_media: {}\n",
|
|
" episode_reward_max: 200.0\n",
|
|
" episode_reward_mean: 124.79\n",
|
|
" episode_reward_min: 9.0\n",
|
|
" episodes_this_iter: 20\n",
|
|
" hist_stats:\n",
|
|
" episode_lengths:\n",
|
|
" - 45\n",
|
|
" - 31\n",
|
|
" - 44\n",
|
|
" - 103\n",
|
|
" - 62\n",
|
|
" - 72\n",
|
|
" - 20\n",
|
|
" - 15\n",
|
|
" - 35\n",
|
|
" - 12\n",
|
|
" - 9\n",
|
|
" - 24\n",
|
|
" - 10\n",
|
|
" - 102\n",
|
|
" - 93\n",
|
|
" - 73\n",
|
|
" - 27\n",
|
|
" - 52\n",
|
|
" - 144\n",
|
|
" - 19\n",
|
|
" - 140\n",
|
|
" - 91\n",
|
|
" - 133\n",
|
|
" - 147\n",
|
|
" - 140\n",
|
|
" - 90\n",
|
|
" - 14\n",
|
|
" - 73\n",
|
|
" - 71\n",
|
|
" - 200\n",
|
|
" - 55\n",
|
|
" - 184\n",
|
|
" - 103\n",
|
|
" - 196\n",
|
|
" - 168\n",
|
|
" - 177\n",
|
|
" - 38\n",
|
|
" - 33\n",
|
|
" - 50\n",
|
|
" - 149\n",
|
|
" - 67\n",
|
|
" - 87\n",
|
|
" - 25\n",
|
|
" - 134\n",
|
|
" - 42\n",
|
|
" - 26\n",
|
|
" - 24\n",
|
|
" - 121\n",
|
|
" - 61\n",
|
|
" - 109\n",
|
|
" - 19\n",
|
|
" - 200\n",
|
|
" - 60\n",
|
|
" - 40\n",
|
|
" - 51\n",
|
|
" - 88\n",
|
|
" - 30\n",
|
|
" - 200\n",
|
|
" - 186\n",
|
|
" - 200\n",
|
|
" - 182\n",
|
|
" - 196\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 43\n",
|
|
" - 200\n",
|
|
" - 109\n",
|
|
" - 156\n",
|
|
" - 200\n",
|
|
" - 183\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 107\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 89\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" - 200\n",
|
|
" episode_reward:\n",
|
|
" - 45.0\n",
|
|
" - 31.0\n",
|
|
" - 44.0\n",
|
|
" - 103.0\n",
|
|
" - 62.0\n",
|
|
" - 72.0\n",
|
|
" - 20.0\n",
|
|
" - 15.0\n",
|
|
" - 35.0\n",
|
|
" - 12.0\n",
|
|
" - 9.0\n",
|
|
" - 24.0\n",
|
|
" - 10.0\n",
|
|
" - 102.0\n",
|
|
" - 93.0\n",
|
|
" - 73.0\n",
|
|
" - 27.0\n",
|
|
" - 52.0\n",
|
|
" - 144.0\n",
|
|
" - 19.0\n",
|
|
" - 140.0\n",
|
|
" - 91.0\n",
|
|
" - 133.0\n",
|
|
" - 147.0\n",
|
|
" - 140.0\n",
|
|
" - 90.0\n",
|
|
" - 14.0\n",
|
|
" - 73.0\n",
|
|
" - 71.0\n",
|
|
" - 200.0\n",
|
|
" - 55.0\n",
|
|
" - 184.0\n",
|
|
" - 103.0\n",
|
|
" - 196.0\n",
|
|
" - 168.0\n",
|
|
" - 177.0\n",
|
|
" - 38.0\n",
|
|
" - 33.0\n",
|
|
" - 50.0\n",
|
|
" - 149.0\n",
|
|
" - 67.0\n",
|
|
" - 87.0\n",
|
|
" - 25.0\n",
|
|
" - 134.0\n",
|
|
" - 42.0\n",
|
|
" - 26.0\n",
|
|
" - 24.0\n",
|
|
" - 121.0\n",
|
|
" - 61.0\n",
|
|
" - 109.0\n",
|
|
" - 19.0\n",
|
|
" - 200.0\n",
|
|
" - 60.0\n",
|
|
" - 40.0\n",
|
|
" - 51.0\n",
|
|
" - 88.0\n",
|
|
" - 30.0\n",
|
|
" - 200.0\n",
|
|
" - 186.0\n",
|
|
" - 200.0\n",
|
|
" - 182.0\n",
|
|
" - 196.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 43.0\n",
|
|
" - 200.0\n",
|
|
" - 109.0\n",
|
|
" - 156.0\n",
|
|
" - 200.0\n",
|
|
" - 183.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 107.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 89.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" - 200.0\n",
|
|
" off_policy_estimator: {}\n",
|
|
" policy_reward_max: {}\n",
|
|
" policy_reward_mean: {}\n",
|
|
" policy_reward_min: {}\n",
|
|
" sampler_perf:\n",
|
|
" mean_action_processing_ms: 0.06817872750804764\n",
|
|
" mean_env_render_ms: 0.0\n",
|
|
" mean_env_wait_ms: 0.05424549075766555\n",
|
|
" mean_inference_ms: 0.5976919122059019\n",
|
|
" mean_raw_obs_processing_ms: 0.09603803519062176\n",
|
|
" time_since_restore: 16.702913284301758\n",
|
|
" time_this_iter_s: 3.1872010231018066\n",
|
|
" time_total_s: 16.702913284301758\n",
|
|
" timers:\n",
|
|
" learn_throughput: 2378.661\n",
|
|
" learn_time_ms: 1681.619\n",
|
|
" load_throughput: 16503261.853\n",
|
|
" load_time_ms: 0.242\n",
|
|
" training_iteration_time_ms: 3336.7\n",
|
|
" update_time_ms: 1.759\n",
|
|
" timestamp: 1652964897\n",
|
|
" timesteps_since_restore: 0\n",
|
|
" timesteps_total: 20000\n",
|
|
" training_iteration: 5\n",
|
|
" trial_id: cd8d6_00000\n",
|
|
" warmup_time: 10.095139741897583\n",
|
|
" \n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"2022-05-19 13:54:58,548\tINFO tune.py:753 -- Total run time: 36.92 seconds (35.95 seconds for the tuning loop).\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"result = train_rl_ppo_online(num_workers=2, use_gpu=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6714a3d6",
|
|
"metadata": {},
|
|
"source": [
|
|
"And then, using the obtained checkpoint, we evaluate the policy on a fresh environment:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "b73bfa0f",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"2022-05-19 13:54:58,589\tINFO trainer.py:1728 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.\n",
|
|
"2022-05-19 13:54:58,590\tWARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!\n",
|
|
"2022-05-19 13:54:58,591\tINFO ppo.py:361 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.\n",
|
|
"2022-05-19 13:54:58,591\tINFO trainer.py:328 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.\n",
|
|
"\u001b[2m\u001b[36m(RolloutWorker pid=14191)\u001b[0m 2022-05-19 13:55:06,622\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
|
|
"\u001b[2m\u001b[36m(RolloutWorker pid=14192)\u001b[0m 2022-05-19 13:55:06,622\tWARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.execution.buffers` has been deprecated. Use `ray.rllib.utils.replay_buffers` instead. This will raise an error in the future!\n",
|
|
"2022-05-19 13:55:07,968\tWARNING util.py:65 -- Install gputil for GPU system monitoring.\n",
|
|
"2022-05-19 13:55:08,021\tINFO trainable.py:589 -- Restored on 127.0.0.1 from checkpoint: /Users/kai/ray_results/AIRPPOTrainer_2022-05-19_13-54-16/AIRPPOTrainer_cd8d6_00000_0_2022-05-19_13-54-22/checkpoint_000005/checkpoint-5\n",
|
|
"2022-05-19 13:55:08,021\tINFO trainable.py:597 -- Current state after restoring: {'_iteration': 5, '_timesteps_total': None, '_time_total': 16.702913284301758, '_episodes_total': 354}\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Average reward over 3 episodes: 200.0\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"num_eval_episodes = 3\n",
|
|
"\n",
|
|
"rewards = evaluate_using_checkpoint(result.checkpoint, num_episodes=num_eval_episodes)\n",
|
|
"print(f\"Average reward over {num_eval_episodes} episodes: \" f\"{np.mean(rewards)}\")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"jupytext": {
|
|
"cell_metadata_filter": "-all",
|
|
"main_language": "python",
|
|
"notebook_metadata_filter": "-all"
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.7"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|