ray/rllib/examples/bandit/tune_lin_ucb_train_recsim_env.py

"""Example of using LinUCB on a RecSim environment. """

import argparse
from matplotlib import pyplot as plt
import pandas as pd
import time

from ray import air, tune
import ray.rllib.examples.env.recommender_system_envs_with_recsim  # noqa


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--framework",
        choices=["tf2", "torch"],
        default="torch",
        help="The DL framework specifier.",
    )
    args = parser.parse_args()
    print(f"Running with following CLI args: {args}")

    ray.init()

    config = {
        # "RecSim-v1" is a pre-registered RecSim env.
        # Alternatively, you can do:
        # `from ray.rllib.examples.env.recommender_system_envs_with_recsim import ...`
        # - LongTermSatisfactionRecSimEnv
        # - InterestExplorationRecSimEnv
        # - InterestEvolutionRecSimEnv
        # Then: "env": [the imported RecSim class]
        "env": "RecSim-v1",
        "env_config": {
            "num_candidates": 10,
            "slate_size": 1,
            "convert_to_discrete_action_space": True,
            "wrap_for_bandits": True,
        },
        "framework": args.framework,
        "eager_tracing": (args.framework == "tf2"),
    }

    # Actual env timesteps per `train()` call will be
    # 100 * min_sample_timesteps_per_iteration (100 by default) = 10,000
    training_iterations = 100

    print("Running training for %s time steps" % training_iterations)

    start_time = time.time()
    tuner = tune.Tuner(
        "BanditLinUCB",
        param_space=config,
        run_config=air.RunConfig(
            stop={"training_iteration": training_iterations},
            checkpoint_config=air.CheckpointConfig(
                checkpoint_at_end=False,
            ),
        ),
        tune_config=tune.TuneConfig(
            num_samples=1,
        ),
    )
    results = tuner.fit()

    print("The trials took", time.time() - start_time, "seconds\n")

    # Analyze cumulative regrets of the trials
    frame = pd.DataFrame()
    for result in results:
        frame = frame.append(result.metrics_dataframe, ignore_index=True)
    x = frame.groupby("agent_timesteps_total")["episode_reward_mean"].aggregate(
        ["mean", "max", "min", "std"]
    )

    plt.plot(x["mean"])
    plt.fill_between(
        x.index, x["mean"] - x["std"], x["mean"] + x["std"], color="b", alpha=0.2
    )
    plt.title("Episode reward mean")
    plt.xlabel("Training steps")
    plt.show()
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00			`"""Example of using LinUCB on a RecSim environment. """`

[RLlib] TF2 Bandit Agent (#22838) 2022-03-21 08:55:55 -07:00			`import argparse`
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00			`from matplotlib import pyplot as plt`
			`import pandas as pd`
			`import time`

[air] update rllib example to use Tuner API. (#26987) update rllib example to use Tuner API. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> 2022-07-27 04:12:59 -07:00			`from ray import air, tune`
[RLlib] Slate-Q tf implementation and tests/benchmarks. (#22389) 2022-02-22 09:36:44 +01:00			`import ray.rllib.examples.env.recommender_system_envs_with_recsim # noqa`
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00

			`if __name__ == "__main__":`
[RLlib] TF2 Bandit Agent (#22838) 2022-03-21 08:55:55 -07:00			`parser = argparse.ArgumentParser()`
			`parser.add_argument(`
			`"--framework",`
			`choices=["tf2", "torch"],`
			`default="torch",`
			`help="The DL framework specifier.",`
			`)`
			`args = parser.parse_args()`
			`print(f"Running with following CLI args: {args}")`

[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00			`ray.init()`

			`config = {`
			`# "RecSim-v1" is a pre-registered RecSim env.`
			`# Alternatively, you can do:`
[RLlib] Slate-Q tf implementation and tests/benchmarks. (#22389) 2022-02-22 09:36:44 +01:00			# `from ray.rllib.examples.env.recommender_system_envs_with_recsim import ...`
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00			`# - LongTermSatisfactionRecSimEnv`
			`# - InterestExplorationRecSimEnv`
			`# - InterestEvolutionRecSimEnv`
			`# Then: "env": [the imported RecSim class]`
			`"env": "RecSim-v1",`
			`"env_config": {`
[RLlib] Update bandit_envs_recommender_system (#22421) 2022-02-24 13:43:41 -08:00			`"num_candidates": 10,`
			`"slate_size": 1,`
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00			`"convert_to_discrete_action_space": True,`
			`"wrap_for_bandits": True,`
			`},`
[RLlib] TF2 Bandit Agent (#22838) 2022-03-21 08:55:55 -07:00			`"framework": args.framework,`
			`"eager_tracing": (args.framework == "tf2"),`
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00			`}`

[RLlib] Deprecate `timesteps_per_iteration` config key (in favor of `min_[sample\|train]_timesteps_per_reporting`. (#24372) 2022-05-02 12:51:14 +02:00			# Actual env timesteps per `train()` call will be
[RLlib] fix bandit pre-merge tests (#27554) 2022-08-07 17:48:29 -07:00			`# 100 * min_sample_timesteps_per_iteration (100 by default) = 10,000`
			`training_iterations = 100`
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00
			`print("Running training for %s time steps" % training_iterations)`

			`start_time = time.time()`
[air] update rllib example to use Tuner API. (#26987) update rllib example to use Tuner API. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> 2022-07-27 04:12:59 -07:00			`tuner = tune.Tuner(`
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00			`"BanditLinUCB",`
[air] update rllib example to use Tuner API. (#26987) update rllib example to use Tuner API. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> 2022-07-27 04:12:59 -07:00			`param_space=config,`
			`run_config=air.RunConfig(`
			`stop={"training_iteration": training_iterations},`
[RLlib] fix bandit pre-merge tests (#27554) 2022-08-07 17:48:29 -07:00			`checkpoint_config=air.CheckpointConfig(`
[air] update rllib example to use Tuner API. (#26987) update rllib example to use Tuner API. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> 2022-07-27 04:12:59 -07:00			`checkpoint_at_end=False,`
			`),`
			`),`
			`tune_config=tune.TuneConfig(`
			`num_samples=1,`
			`),`
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00			`)`
[air] update rllib example to use Tuner API. (#26987) update rllib example to use Tuner API. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> 2022-07-27 04:12:59 -07:00			`results = tuner.fit()`
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00
			`print("The trials took", time.time() - start_time, "seconds\n")`

			`# Analyze cumulative regrets of the trials`
			`frame = pd.DataFrame()`
[air] update rllib example to use Tuner API. (#26987) update rllib example to use Tuner API. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> 2022-07-27 04:12:59 -07:00			`for result in results:`
			`frame = frame.append(result.metrics_dataframe, ignore_index=True)`
[RLlib] Add an env wrapper so RecSim works with our Bandits agent. (#22028) 2022-02-02 03:15:38 -08:00			`x = frame.groupby("agent_timesteps_total")["episode_reward_mean"].aggregate(`
			`["mean", "max", "min", "std"]`
			`)`

			`plt.plot(x["mean"])`
			`plt.fill_between(`
			`x.index, x["mean"] - x["std"], x["mean"] + x["std"], color="b", alpha=0.2`
			`)`
			`plt.title("Episode reward mean")`
			`plt.xlabel("Training steps")`
			`plt.show()`