ray/doc/source/ray-air/examples/analyze_tuning_results.ipynb
xwjiang2010 ff2b728e9a
[air] add tuner user guide (#26837)
Co-authored-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-03 09:43:42 -07:00

527 lines
28 KiB
Text

{
"cells": [
{
"cell_type": "markdown",
"id": "c62191af",
"metadata": {},
"source": [
"# Hyperparameter tuning with XGBoostTrainer\n",
"In this example, we will go through how you can use Ray AIR to run a distributed hyperparameter experiment to find optimal hyperparameters for an XGBoost model.\n",
"\n",
"What we'll cover:\n",
"- How to load data from an Sklearn example dataset\n",
"- How to initialize an XGBoost trainer\n",
"- How to define a search space for regular XGBoost parameters and for data preprocessors\n",
"- How to fetch the best obtained result from the tuning run\n",
"- How to fetch a dataframe to do further analysis on the results"
]
},
{
"cell_type": "markdown",
"id": "41abda7b",
"metadata": {},
"source": [
"We'll use the [Covertype dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_covtype.html#sklearn-datasets-fetch-covtype) provided from sklearn to train a multiclass classification task using XGBoost.\n",
"\n",
"In this dataset, we try to predict the forst cover type (e.g. \"lodgehole pine\") from cartographic variables, like the distance to the closest road, or the hillshade at different times of the day. The features are binary, discrete and continuous and thus well suited for a decision-tree based classification task.\n",
"\n",
"You can find more information about the dataset [on the dataset homepage](https://archive.ics.uci.edu/ml/datasets/Covertype).\n",
"\n",
"We will train XGBoost models on this dataset. Because model training performance can be influenced by hyperparameter choices, we will generate several different configurations and train them in parallel. Notably each of these trials will itself start a distributed training job to speed up training. All of this happens automatically within Ray AIR.\n",
"\n",
"First, let's make sure we have all dependencies installed:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "db506971",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33mWARNING: You are using pip version 21.3.1; however, version 22.0.4 is available.\r\n",
"You should consider upgrading via the '/Users/kai/.pyenv/versions/3.7.7/bin/python3.7 -m pip install --upgrade pip' command.\u001b[0m\r\n"
]
}
],
"source": [
"!pip install -q \"ray[all]\" sklearn"
]
},
{
"cell_type": "markdown",
"id": "ec82829d",
"metadata": {},
"source": [
"Then we can start with some imports."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d45b4f69",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.datasets import fetch_covtype\n",
"\n",
"import ray\n",
"from ray import tune\n",
"from ray.air import RunConfig, ScalingConfig\n",
"from ray.train.xgboost import XGBoostTrainer\n",
"from ray.tune.tune_config import TuneConfig\n",
"from ray.tune.tuner import Tuner"
]
},
{
"cell_type": "markdown",
"id": "a93b242c",
"metadata": {},
"source": [
"We'll define a utility function to create a Ray Dataset from the Sklearn dataset. We expect the target column to be in the dataframe, so we'll add it to the dataframe manually."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3875df98",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-05-13 12:31:51,444\tINFO services.py:1484 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n"
]
}
],
"source": [
"def get_training_data() -> ray.data.Dataset:\n",
" data_raw = fetch_covtype()\n",
" df = pd.DataFrame(data_raw[\"data\"], columns=data_raw[\"feature_names\"])\n",
" df[\"target\"] = data_raw[\"target\"]\n",
" return ray.data.from_pandas(df)\n",
"\n",
"\n",
"train_dataset = get_training_data()"
]
},
{
"cell_type": "markdown",
"id": "90dd6fc5",
"metadata": {},
"source": [
"Let's take a look at the schema here:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "936c9b26",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset(num_blocks=1, num_rows=581012, schema={Elevation: float64, Aspect: float64, Slope: float64, Horizontal_Distance_To_Hydrology: float64, Vertical_Distance_To_Hydrology: float64, Horizontal_Distance_To_Roadways: float64, Hillshade_9am: float64, Hillshade_Noon: float64, Hillshade_3pm: float64, Horizontal_Distance_To_Fire_Points: float64, Wilderness_Area_0: float64, Wilderness_Area_1: float64, Wilderness_Area_2: float64, Wilderness_Area_3: float64, Soil_Type_0: float64, Soil_Type_1: float64, Soil_Type_2: float64, Soil_Type_3: float64, Soil_Type_4: float64, Soil_Type_5: float64, Soil_Type_6: float64, Soil_Type_7: float64, Soil_Type_8: float64, Soil_Type_9: float64, Soil_Type_10: float64, Soil_Type_11: float64, Soil_Type_12: float64, Soil_Type_13: float64, Soil_Type_14: float64, Soil_Type_15: float64, Soil_Type_16: float64, Soil_Type_17: float64, Soil_Type_18: float64, Soil_Type_19: float64, Soil_Type_20: float64, Soil_Type_21: float64, Soil_Type_22: float64, Soil_Type_23: float64, Soil_Type_24: float64, Soil_Type_25: float64, Soil_Type_26: float64, Soil_Type_27: float64, Soil_Type_28: float64, Soil_Type_29: float64, Soil_Type_30: float64, Soil_Type_31: float64, Soil_Type_32: float64, Soil_Type_33: float64, Soil_Type_34: float64, Soil_Type_35: float64, Soil_Type_36: float64, Soil_Type_37: float64, Soil_Type_38: float64, Soil_Type_39: float64, target: int32})\n"
]
}
],
"source": [
"print(train_dataset)"
]
},
{
"cell_type": "markdown",
"id": "282a6078",
"metadata": {},
"source": [
"Since we'll be training a multiclass prediction model, we have to pass some information to XGBoost. For instance, XGBoost expects us to provide the number of classes, and multiclass-enabled evaluation metrices.\n",
"\n",
"For a good overview of commonly used hyperparameters, see [our tutorial in the docs](https://docs.ray.io/en/latest/tune/examples/tune-xgboost.html#xgboost-hyperparameters)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "60fdd48d",
"metadata": {},
"outputs": [],
"source": [
"# XGBoost specific params\n",
"params = {\n",
" \"tree_method\": \"approx\",\n",
" \"objective\": \"multi:softmax\",\n",
" \"eval_metric\": [\"mlogloss\", \"merror\"],\n",
" \"num_class\": 8,\n",
" \"min_child_weight\": 2\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "228ae052",
"metadata": {},
"source": [
"With these parameters in place, we'll create a Ray AIR `XGBoostTrainer`.\n",
"\n",
"Note a few things here. First, we pass in a `scaling_config` to configure the distributed training behavior of each individual XGBoost training job. Here, we want to distribute training across 2 workers.\n",
"\n",
"The `label_column` specifies which columns in the dataset contains the target values. `params` are the XGBoost training params defined above - we can tune these later! The `datasets` dict contains the dataset we would like to train on. Lastly, we pass the number of boosting rounds to XGBoost."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "bbece53a",
"metadata": {},
"outputs": [],
"source": [
"trainer = XGBoostTrainer(\n",
" scaling_config=ScalingConfig(num_workers=2),\n",
" label_column=\"target\",\n",
" params=params,\n",
" datasets={\"train\": train_dataset},\n",
" num_boost_round=10,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e436035b",
"metadata": {},
"source": [
"We can now create the Tuner with a search space to override some of the default parameters in the XGBoost trainer.\n",
"\n",
"Here, we just want to the XGBoost `max_depth` and `min_child_weights` parameters. Note that we specifically specified `min_child_weight=2` in the default XGBoost trainer - this value will be overwritten during tuning.\n",
"\n",
"We configure Tune to minimize the `train-mlogloss` metric. In random search, this doesn't affect the evaluated configurations, but it will affect our default results fetching for analysis later.\n",
"\n",
"By the way, the name `train-mlogloss` is provided by the XGBoost library - `train` is the name of the dataset and `mlogloss` is the metric we passed in the XGBoost `params` above. Trainables can report any number of results (in this case we report 2), but most search algorithms only act on one of them - here we chose the `mlogloss`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c5d00628",
"metadata": {},
"outputs": [],
"source": [
"tuner = Tuner(\n",
" trainer,\n",
" run_config=RunConfig(verbose=1),\n",
" param_space={\n",
" \"params\": {\n",
" \"max_depth\": tune.randint(2, 8), \n",
" \"min_child_weight\": tune.randint(1, 10), \n",
" },\n",
" },\n",
" tune_config=TuneConfig(num_samples=8, metric=\"train-mlogloss\", mode=\"min\"),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "7ba9cf69",
"metadata": {},
"source": [
"Let's run the tuning. This will take a few minutes to complete."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "ab642705",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"== Status ==<br>Current time: 2022-05-13 12:35:33 (running for 00:03:37.49)<br>Memory usage on this node: 10.0/16.0 GiB<br>Using FIFO scheduling algorithm.<br>Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/6.73 GiB heap, 0.0/2.0 GiB objects<br>Current best trial: 4ab2f_00007 with train-mlogloss=0.560217 and parameters={'params': {'max_depth': 7, 'min_child_weight': 4}}<br>Result logdir: /Users/kai/ray_results/XGBoostTrainer_2022-05-13_12-31-55<br>Number of trials: 8/8 (8 TERMINATED)<br><br>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[2m\u001b[36m(GBDTTrainable pid=62456)\u001b[0m UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62456)\u001b[0m 2022-05-13 12:32:02,793\tINFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62464)\u001b[0m UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62463)\u001b[0m UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62465)\u001b[0m UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62466)\u001b[0m UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62463)\u001b[0m 2022-05-13 12:32:05,102\tINFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62466)\u001b[0m 2022-05-13 12:32:05,204\tINFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62464)\u001b[0m 2022-05-13 12:32:05,338\tINFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62465)\u001b[0m 2022-05-13 12:32:07,164\tINFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62456)\u001b[0m 2022-05-13 12:32:10,549\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62495)\u001b[0m [12:32:10] task [xgboost.ray]:6975277392 got new rank 1\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62494)\u001b[0m [12:32:10] task [xgboost.ray]:4560390352 got new rank 0\n",
"\u001b[2m\u001b[36m(raylet)\u001b[0m Spilled 2173 MiB, 22 objects, write throughput 402 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62463)\u001b[0m 2022-05-13 12:32:17,848\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62523)\u001b[0m [12:32:18] task [xgboost.ray]:4441524624 got new rank 0\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62524)\u001b[0m [12:32:18] task [xgboost.ray]:6890641808 got new rank 1\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62465)\u001b[0m 2022-05-13 12:32:21,253\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62466)\u001b[0m 2022-05-13 12:32:21,529\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62563)\u001b[0m [12:32:21] task [xgboost.ray]:4667801680 got new rank 1\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62562)\u001b[0m [12:32:21] task [xgboost.ray]:6856360848 got new rank 0\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62530)\u001b[0m [12:32:21] task [xgboost.ray]:6971527824 got new rank 0\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62532)\u001b[0m [12:32:21] task [xgboost.ray]:4538321232 got new rank 1\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62464)\u001b[0m 2022-05-13 12:32:21,937\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62544)\u001b[0m [12:32:21] task [xgboost.ray]:7005661840 got new rank 1\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62543)\u001b[0m [12:32:21] task [xgboost.ray]:4516088080 got new rank 0\n",
"\u001b[2m\u001b[36m(raylet)\u001b[0m Spilled 4098 MiB, 83 objects, write throughput 347 MiB/s.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62456)\u001b[0m 2022-05-13 12:32:41,289\tINFO main.py:1109 -- Training in progress (31 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62463)\u001b[0m 2022-05-13 12:32:48,617\tINFO main.py:1109 -- Training in progress (31 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62465)\u001b[0m 2022-05-13 12:32:52,110\tINFO main.py:1109 -- Training in progress (31 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62466)\u001b[0m 2022-05-13 12:32:52,448\tINFO main.py:1109 -- Training in progress (31 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62464)\u001b[0m 2022-05-13 12:32:52,692\tINFO main.py:1109 -- Training in progress (31 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62456)\u001b[0m 2022-05-13 12:33:11,960\tINFO main.py:1109 -- Training in progress (61 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62463)\u001b[0m 2022-05-13 12:33:19,076\tINFO main.py:1109 -- Training in progress (61 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62464)\u001b[0m 2022-05-13 12:33:23,409\tINFO main.py:1109 -- Training in progress (61 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62465)\u001b[0m 2022-05-13 12:33:23,420\tINFO main.py:1109 -- Training in progress (62 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62466)\u001b[0m 2022-05-13 12:33:23,541\tINFO main.py:1109 -- Training in progress (62 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62463)\u001b[0m 2022-05-13 12:33:23,693\tINFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 78.74 seconds (65.79 pure XGBoost training time).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62464)\u001b[0m 2022-05-13 12:33:24,802\tINFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 79.62 seconds (62.85 pure XGBoost training time).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62648)\u001b[0m UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62651)\u001b[0m UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62648)\u001b[0m 2022-05-13 12:33:38,788\tINFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62651)\u001b[0m 2022-05-13 12:33:38,766\tINFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62456)\u001b[0m 2022-05-13 12:33:42,168\tINFO main.py:1109 -- Training in progress (92 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62456)\u001b[0m 2022-05-13 12:33:46,177\tINFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 103.54 seconds (95.60 pure XGBoost training time).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62651)\u001b[0m 2022-05-13 12:33:51,825\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62670)\u001b[0m [12:33:51] task [xgboost.ray]:4623186960 got new rank 1\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62669)\u001b[0m [12:33:51] task [xgboost.ray]:4707639376 got new rank 0\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62648)\u001b[0m 2022-05-13 12:33:52,036\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62672)\u001b[0m [12:33:52] task [xgboost.ray]:4530073552 got new rank 1\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62671)\u001b[0m [12:33:52] task [xgboost.ray]:6824757200 got new rank 0\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62466)\u001b[0m 2022-05-13 12:33:54,229\tINFO main.py:1109 -- Training in progress (92 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62465)\u001b[0m 2022-05-13 12:33:54,355\tINFO main.py:1109 -- Training in progress (93 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62730)\u001b[0m UserWarning: Dataset 'train' has 1 blocks, which is less than the `num_workers` 2. This dataset will be automatically repartitioned to 2 blocks.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62730)\u001b[0m 2022-05-13 12:34:04,708\tINFO main.py:980 -- [RayXGBoost] Created 2 new actors (2 total actors). Waiting until actors are ready for training.\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62466)\u001b[0m 2022-05-13 12:34:11,126\tINFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 126.08 seconds (109.48 pure XGBoost training time).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62730)\u001b[0m 2022-05-13 12:34:15,175\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62753)\u001b[0m [12:34:15] task [xgboost.ray]:4468564048 got new rank 1\n",
"\u001b[2m\u001b[36m(_RemoteRayXGBoostActor pid=62752)\u001b[0m [12:34:15] task [xgboost.ray]:6799468304 got new rank 0\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[2m\u001b[36m(GBDTTrainable pid=62648)\u001b[0m 2022-05-13 12:34:22,167\tINFO main.py:1109 -- Training in progress (30 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62651)\u001b[0m 2022-05-13 12:34:22,147\tINFO main.py:1109 -- Training in progress (30 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62465)\u001b[0m 2022-05-13 12:34:24,646\tINFO main.py:1109 -- Training in progress (123 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62465)\u001b[0m 2022-05-13 12:34:24,745\tINFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 137.75 seconds (123.36 pure XGBoost training time).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62651)\u001b[0m 2022-05-13 12:34:40,173\tINFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 61.63 seconds (48.34 pure XGBoost training time).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62730)\u001b[0m 2022-05-13 12:34:45,745\tINFO main.py:1109 -- Training in progress (31 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62648)\u001b[0m 2022-05-13 12:34:52,543\tINFO main.py:1109 -- Training in progress (60 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62648)\u001b[0m 2022-05-13 12:35:14,888\tINFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 96.35 seconds (82.83 pure XGBoost training time).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62730)\u001b[0m 2022-05-13 12:35:16,197\tINFO main.py:1109 -- Training in progress (61 seconds since last restart).\n",
"\u001b[2m\u001b[36m(GBDTTrainable pid=62730)\u001b[0m 2022-05-13 12:35:33,441\tINFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=581,012 in 88.89 seconds (78.26 pure XGBoost training time).\n",
"2022-05-13 12:35:33,610\tINFO tune.py:753 -- Total run time: 218.52 seconds (217.48 seconds for the tuning loop).\n"
]
}
],
"source": [
"results = tuner.fit()"
]
},
{
"cell_type": "markdown",
"id": "2c7c444e",
"metadata": {},
"source": [
"Now that we obtained the results, we can analyze them. For instance, we can fetch the best observed result according to the configured `metric` and `mode` and print it:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "4f4e5187",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Best result error rate 0.196929\n"
]
}
],
"source": [
"# This will fetch the best result according to the `metric` and `mode` specified\n",
"# in the `TuneConfig` above:\n",
"\n",
"best_result = results.get_best_result()\n",
"\n",
"print(\"Best result error rate\", best_result.metrics[\"train-merror\"])"
]
},
{
"cell_type": "markdown",
"id": "53d71b08",
"metadata": {},
"source": [
"For more sophisticated analysis, we can get a pandas dataframe with all trial results:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "50e76e91",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['train-mlogloss', 'train-merror', 'time_this_iter_s',\n",
" 'should_checkpoint', 'done', 'timesteps_total', 'episodes_total',\n",
" 'training_iteration', 'trial_id', 'experiment_id', 'date', 'timestamp',\n",
" 'time_total_s', 'pid', 'hostname', 'node_ip', 'time_since_restore',\n",
" 'timesteps_since_restore', 'iterations_since_restore', 'warmup_time',\n",
" 'config/params/max_depth', 'config/params/min_child_weight', 'logdir'],\n",
" dtype='object')\n"
]
}
],
"source": [
"df = results.get_dataframe()\n",
"print(df.columns)"
]
},
{
"cell_type": "markdown",
"id": "b8a05668",
"metadata": {},
"source": [
"As an example, let's group the results per `min_child_weight` parameter and fetch the minimal obtained values:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "94b017b4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Min child weight 1 error 0.262468 logloss 0.69843\n",
"Min child weight 2 error 0.311035 logloss 0.79498\n",
"Min child weight 3 error 0.240916 logloss 0.651457\n",
"Min child weight 4 error 0.196929 logloss 0.560217\n",
"Min child weight 6 error 0.219665 logloss 0.608005\n",
"Min child weight 7 error 0.311035 logloss 0.794983\n",
"Min child weight 8 error 0.311035 logloss 0.794983\n"
]
}
],
"source": [
"groups = df.groupby(\"config/params/min_child_weight\")\n",
"mins = groups.min()\n",
"\n",
"for min_child_weight, row in mins.iterrows():\n",
" print(\"Min child weight\", min_child_weight, \"error\", row[\"train-merror\"], \"logloss\", row[\"train-mlogloss\"])\n"
]
},
{
"cell_type": "markdown",
"id": "8e135ee9",
"metadata": {},
"source": [
"As you can see in our example run, the min child weight of `2` showed the best prediction accuracy with `0.196929`. That's the same as `results.get_best_result()` gave us!"
]
},
{
"cell_type": "markdown",
"id": "2f478e4c",
"metadata": {},
"source": [
"The `results.get_dataframe()` returns the last reported results per trial. If you want to obtain the best _ever_ observed results, you can pass the `filter_metric` and `filter_mode` arguments to `results.get_dataframe()`. In our example, we'll filter the minimum _ever_ observed `train-merror` for each trial:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "afa83cf6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.262468\n",
"1 0.310307\n",
"2 0.310307\n",
"3 0.219665\n",
"4 0.240916\n",
"5 0.220801\n",
"6 0.310307\n",
"7 0.196929\n",
"Name: train-merror, dtype: float64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_min_error = results.get_dataframe(filter_metric=\"train-merror\", filter_mode=\"min\")\n",
"df_min_error[\"train-merror\"]"
]
},
{
"cell_type": "markdown",
"id": "3f4525cb",
"metadata": {},
"source": [
"The best ever observed `train-merror` is `0.196929`, the same as the minimum error in our grouped results. This is expected, as the classification error in XGBoost usually goes down over time - meaning our last results are usually the best results."
]
},
{
"cell_type": "markdown",
"id": "47236b0a",
"metadata": {},
"source": [
"And that's how you analyze your hyperparameter tuning results. If you would like to have access to more analytics, please feel free to file a feature request e.g. [as a Github issue](https://github.com/ray-project/ray/issues) or on our [Discuss platform](https://discuss.ray.io/)!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}