{
"cells": [
{
"cell_type": "markdown",
"id": "0d385409",
"metadata": {},
"source": [
"# Training a model with distributed LightGBM\n",
"In this example we will train a model in Ray AIR using distributed LightGBM."
]
},
{
"cell_type": "markdown",
"id": "07d92cee",
"metadata": {},
"source": [
"Let's start with installing our dependencies:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "86131abe",
"metadata": {},
"outputs": [],
"source": [
"!pip install -qU \"ray[tune]\" lightgbm_ray"
]
},
{
"cell_type": "markdown",
"id": "135fc884",
"metadata": {},
"source": [
"Then we need some imports:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "102ef1ac",
"metadata": {},
"outputs": [],
"source": [
"from typing import Tuple\n",
"\n",
"import ray\n",
"from ray.train.batch_predictor import BatchPredictor\n",
"from ray.train.lightgbm import LightGBMPredictor\n",
"from ray.data.preprocessors.chain import Chain\n",
"from ray.data.preprocessors.encoder import Categorizer\n",
"from ray.train.lightgbm import LightGBMTrainer\n",
"from ray.data.dataset import Dataset\n",
"from ray.air.result import Result\n",
"from ray.air.util.datasets import train_test_split\n",
"from ray.data.preprocessors import StandardScaler"
]
},
{
"cell_type": "markdown",
"id": "c7d102bd",
"metadata": {},
"source": [
"Next we define a function to load our train, validation, and test datasets."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f1f35cd7",
"metadata": {},
"outputs": [],
"source": [
"def prepare_data() -> Tuple[Dataset, Dataset, Dataset]:\n",
" import pandas as pd\n",
" df = pd.read_csv(\"https://air-example-data.s3.us-east-2.amazonaws.com/breast_cancer_with_categorical.csv\")\n",
" dataset = ray.data.from_pandas(df)\n",
" # Optionally, read directly from s3\n",
" # dataset = ray.data.read_csv(\"s3://air-example-data/breast_cancer_with_categorical.csv\")\n",
" train_dataset, valid_dataset = train_test_split(dataset, test_size=0.3)\n",
" test_dataset = valid_dataset.map_batches(lambda df: df.drop(\"target\", axis=1), batch_format=\"pandas\")\n",
" return train_dataset, valid_dataset, test_dataset"
]
},
{
"cell_type": "markdown",
"id": "8f7afbce",
"metadata": {},
"source": [
"The following function will create a LightGBM trainer, train it, and return the result."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "fefcbc8a",
"metadata": {},
"outputs": [],
"source": [
"def train_lightgbm(num_workers: int, use_gpu: bool = False) -> Result:\n",
" train_dataset, valid_dataset, _ = prepare_data()\n",
"\n",
" # Scale some random columns, and categorify the categorical_column,\n",
" # allowing LightGBM to use its built-in categorical feature support\n",
" columns_to_scale = [\"mean radius\", \"mean texture\"]\n",
" preprocessor = Chain(\n",
" Categorizer([\"categorical_column\"]), StandardScaler(columns=columns_to_scale)\n",
" )\n",
"\n",
" # LightGBM specific params\n",
" params = {\n",
" \"objective\": \"binary\",\n",
" \"metric\": [\"binary_logloss\", \"binary_error\"],\n",
" }\n",
"\n",
" trainer = LightGBMTrainer(\n",
" scaling_config={\n",
" \"num_workers\": num_workers,\n",
" \"use_gpu\": use_gpu,\n",
" },\n",
" label_column=\"target\",\n",
" params=params,\n",
" datasets={\"train\": train_dataset, \"valid\": valid_dataset},\n",
" preprocessor=preprocessor,\n",
" num_boost_round=100,\n",
" )\n",
" result = trainer.fit()\n",
" print(result.metrics)\n",
"\n",
" return result"
]
},
{
"cell_type": "markdown",
"id": "04d278ae",
"metadata": {},
"source": [
"Once we have the result, we can do batch inference on the obtained model. Let's define a utility function for this."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "3f1d0c19",
"metadata": {},
"outputs": [],
"source": [
"def predict_lightgbm(result: Result):\n",
" _, _, test_dataset = prepare_data()\n",
" batch_predictor = BatchPredictor.from_checkpoint(\n",
" result.checkpoint, LightGBMPredictor\n",
" )\n",
"\n",
" predicted_labels = (\n",
" batch_predictor.predict(test_dataset)\n",
" .map_batches(lambda df: (df > 0.5).astype(int), batch_format=\"pandas\")\n",
" )\n",
" print(f\"PREDICTED LABELS\")\n",
" predicted_labels.show()\n",
"\n",
" shap_values = batch_predictor.predict(test_dataset, pred_contrib=True)\n",
" print(f\"SHAP VALUES\")\n",
" shap_values.show()"
]
},
{
"cell_type": "markdown",
"id": "2bb0e5df",
"metadata": {},
"source": [
"Now we can run the training:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "8244ff3c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-06-22 17:26:41,346\tWARNING read_api.py:260 -- The number of blocks in this dataset (1) limits its parallelism to 1 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.\n",
"Map_Batches: 100%|██████████| 1/1 [00:00<00:00, 46.26it/s]\n"
]
},
{
"data": {
"text/html": [
"== Status ==
Current time: 2022-06-22 17:26:56 (running for 00:00:14.07)
Memory usage on this node: 10.0/31.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/13.32 GiB heap, 0.0/6.66 GiB objects
Result logdir: /home/ubuntu/ray_results/LightGBMTrainer_2022-06-22_17-26-41
Number of trials: 1/1 (1 TERMINATED)
Trial name | status | loc | iter | total time (s) | train-binary_logloss | train-binary_error | valid-binary_logloss |
---|---|---|---|---|---|---|---|
LightGBMTrainer_7b049_00000 | TERMINATED | 172.31.43.110:1491578 | 100 | 10.9726 | 0.000574522 | 0 | 0.171898 |