{
"cells": [
{
"cell_type": "markdown",
"id": "0d385409",
"metadata": {},
"source": [
"# Training a model with distributed LightGBM\n",
"In this example we will train a model in Ray Air using distributed LightGBM."
]
},
{
"cell_type": "markdown",
"id": "07d92cee",
"metadata": {},
"source": [
"Let's start with installing our dependencies:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "86131abe",
"metadata": {},
"outputs": [],
"source": [
"!pip install -qU \"ray[tune]\" lightgbm_ray"
]
},
{
"cell_type": "markdown",
"id": "135fc884",
"metadata": {},
"source": [
"Then we need some imports:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "102ef1ac",
"metadata": {},
"outputs": [],
"source": [
"import argparse\n",
"import math\n",
"from typing import Tuple\n",
"\n",
"import pandas as pd\n",
"\n",
"import ray\n",
"from ray.air.batch_predictor import BatchPredictor\n",
"from ray.air.predictors.integrations.lightgbm import LightGBMPredictor\n",
"from ray.air.preprocessors.chain import Chain\n",
"from ray.air.preprocessors.encoder import Categorizer\n",
"from ray.air.train.integrations.lightgbm import LightGBMTrainer\n",
"from ray.data.dataset import Dataset\n",
"from ray.air.result import Result\n",
"from ray.air.preprocessors import StandardScaler\n",
"from sklearn.datasets import load_breast_cancer\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"id": "c7d102bd",
"metadata": {},
"source": [
"Next we define a function to load our train, validation, and test datasets."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f1f35cd7",
"metadata": {},
"outputs": [],
"source": [
"def prepare_data() -> Tuple[Dataset, Dataset, Dataset]:\n",
" data_raw = load_breast_cancer()\n",
" dataset_df = pd.DataFrame(data_raw[\"data\"], columns=data_raw[\"feature_names\"])\n",
" dataset_df[\"target\"] = data_raw[\"target\"]\n",
" # add a random categorical column\n",
" num_samples = len(dataset_df)\n",
" dataset_df[\"categorical_column\"] = pd.Series(\n",
" ([\"A\", \"B\"] * math.ceil(num_samples / 2))[:num_samples]\n",
" )\n",
" train_df, test_df = train_test_split(dataset_df, test_size=0.3)\n",
" train_dataset = ray.data.from_pandas(train_df)\n",
" valid_dataset = ray.data.from_pandas(test_df)\n",
" test_dataset = ray.data.from_pandas(test_df.drop(\"target\", axis=1))\n",
" return train_dataset, valid_dataset, test_dataset"
]
},
{
"cell_type": "markdown",
"id": "8f7afbce",
"metadata": {},
"source": [
"The following function will create a LightGBM trainer, train it, and return the result."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "fefcbc8a",
"metadata": {},
"outputs": [],
"source": [
"def train_lightgbm(num_workers: int, use_gpu: bool = False) -> Result:\n",
" train_dataset, valid_dataset, _ = prepare_data()\n",
"\n",
" # Scale some random columns, and categorify the categorical_column,\n",
" # allowing LightGBM to use its built-in categorical feature support\n",
" columns_to_scale = [\"mean radius\", \"mean texture\"]\n",
" preprocessor = Chain(\n",
" Categorizer([\"categorical_column\"]), StandardScaler(columns=columns_to_scale)\n",
" )\n",
"\n",
" # LightGBM specific params\n",
" params = {\n",
" \"objective\": \"binary\",\n",
" \"metric\": [\"binary_logloss\", \"binary_error\"],\n",
" }\n",
"\n",
" trainer = LightGBMTrainer(\n",
" scaling_config={\n",
" \"num_workers\": num_workers,\n",
" \"use_gpu\": use_gpu,\n",
" },\n",
" label_column=\"target\",\n",
" params=params,\n",
" datasets={\"train\": train_dataset, \"valid\": valid_dataset},\n",
" preprocessor=preprocessor,\n",
" num_boost_round=100,\n",
" )\n",
" result = trainer.fit()\n",
" print(result.metrics)\n",
"\n",
" return result"
]
},
{
"cell_type": "markdown",
"id": "04d278ae",
"metadata": {},
"source": [
"Once we have the result, we can do batch inference on the obtained model. Let's define a utility function for this."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "3f1d0c19",
"metadata": {},
"outputs": [],
"source": [
"def predict_lightgbm(result: Result):\n",
" _, _, test_dataset = prepare_data()\n",
" batch_predictor = BatchPredictor.from_checkpoint(\n",
" result.checkpoint, LightGBMPredictor\n",
" )\n",
"\n",
" predicted_labels = (\n",
" batch_predictor.predict(test_dataset)\n",
" .map_batches(lambda df: (df > 0.5).astype(int), batch_format=\"pandas\")\n",
" .to_pandas(limit=float(\"inf\"))\n",
" )\n",
" print(f\"PREDICTED LABELS\\n{predicted_labels}\")\n",
"\n",
" shap_values = batch_predictor.predict(test_dataset, pred_contrib=True).to_pandas(\n",
" limit=float(\"inf\")\n",
" )\n",
" print(f\"SHAP VALUES\\n{shap_values}\")"
]
},
{
"cell_type": "markdown",
"id": "2bb0e5df",
"metadata": {},
"source": [
"Now we can run the training:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "8244ff3c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-05-19 11:18:27,652\tINFO services.py:1483 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n"
]
},
{
"data": {
"text/html": [
"== Status ==
Current time: 2022-05-19 11:18:47 (running for 00:00:15.19)
Memory usage on this node: 10.2/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.86 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/LightGBMTrainer_2022-05-19_11-18-30
Number of trials: 1/1 (1 TERMINATED)
Trial name | status | loc | iter | total time (s) | train-binary_logloss | train-binary_error | valid-binary_logloss |
---|---|---|---|---|---|---|---|
LightGBMTrainer_07bf3_00000 | TERMINATED | 127.0.0.1:9219 | 100 | 10.4622 | 0.000197893 | 0 | 0.289033 |