{
"cells": [
{
"cell_type": "markdown",
"id": "c3192ac4",
"metadata": {},
"source": [
"# Training a model with Sklearn\n",
"In this example we will train a model in Ray AIR using a Sklearn classifier."
]
},
{
"cell_type": "markdown",
"id": "5a4823bf",
"metadata": {},
"source": [
"Let's start with installing our dependencies:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "88f4bb39",
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"!pip install -qU \"ray[tune]\" sklearn"
]
},
{
"cell_type": "markdown",
"id": "c049c692",
"metadata": {},
"source": [
"Then we need some imports:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c02eb5cd",
"metadata": {},
"outputs": [],
"source": [
"import argparse\n",
"import math\n",
"from typing import Tuple\n",
"\n",
"import pandas as pd\n",
"\n",
"import ray\n",
"from ray.data.dataset import Dataset\n",
"from ray.train.batch_predictor import BatchPredictor\n",
"from ray.train.sklearn import SklearnPredictor\n",
"from ray.data.preprocessors import Chain, OrdinalEncoder, StandardScaler\n",
"from ray.air.result import Result\n",
"from ray.train.sklearn import SklearnTrainer\n",
"\n",
"\n",
"from sklearn.datasets import load_breast_cancer\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"try:\n",
" from cuml.ensemble import RandomForestClassifier as cuMLRandomForestClassifier\n",
"except ImportError:\n",
" cuMLRandomForestClassifier = None"
]
},
{
"cell_type": "markdown",
"id": "52e017f1",
"metadata": {},
"source": [
"Next we define a function to load our train, validation, and test datasets."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "3631ed1e",
"metadata": {},
"outputs": [],
"source": [
"def prepare_data() -> Tuple[Dataset, Dataset, Dataset]:\n",
" data_raw = load_breast_cancer()\n",
" dataset_df = pd.DataFrame(data_raw[\"data\"], columns=data_raw[\"feature_names\"])\n",
" dataset_df[\"target\"] = data_raw[\"target\"]\n",
" # add a random categorical column\n",
" num_samples = len(dataset_df)\n",
" dataset_df[\"categorical_column\"] = pd.Series(\n",
" ([\"A\", \"B\"] * math.ceil(num_samples / 2))[:num_samples]\n",
" )\n",
" train_df, test_df = train_test_split(dataset_df, test_size=0.3)\n",
" train_dataset = ray.data.from_pandas(train_df)\n",
" valid_dataset = ray.data.from_pandas(test_df)\n",
" test_dataset = ray.data.from_pandas(test_df.drop(\"target\", axis=1))\n",
" return train_dataset, valid_dataset, test_dataset"
]
},
{
"cell_type": "markdown",
"id": "8d6c6d17",
"metadata": {},
"source": [
"The following function will create a Sklearn trainer, train it, and return the result."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0fd39e42",
"metadata": {},
"outputs": [],
"source": [
"def train_sklearn(num_cpus: int, use_gpu: bool = False) -> Result:\n",
" if use_gpu and not cuMLRandomForestClassifier:\n",
" raise RuntimeError(\"cuML must be installed for GPU enabled sklearn estimators.\")\n",
"\n",
" train_dataset, valid_dataset, _ = prepare_data()\n",
"\n",
" # Scale some random columns\n",
" columns_to_scale = [\"mean radius\", \"mean texture\"]\n",
" preprocessor = Chain(\n",
" OrdinalEncoder([\"categorical_column\"]), StandardScaler(columns=columns_to_scale)\n",
" )\n",
"\n",
" if use_gpu:\n",
" trainer_resources = {\"CPU\": 1, \"GPU\": 1}\n",
" estimator = cuMLRandomForestClassifier()\n",
" else:\n",
" trainer_resources = {\"CPU\": num_cpus}\n",
" estimator = RandomForestClassifier()\n",
"\n",
" trainer = SklearnTrainer(\n",
" estimator=estimator,\n",
" label_column=\"target\",\n",
" datasets={\"train\": train_dataset, \"valid\": valid_dataset},\n",
" preprocessor=preprocessor,\n",
" cv=5,\n",
" scaling_config={\n",
" \"trainer_resources\": trainer_resources,\n",
" },\n",
" )\n",
" result = trainer.fit()\n",
" print(result.metrics)\n",
"\n",
" return result"
]
},
{
"cell_type": "markdown",
"id": "7a2efb9d",
"metadata": {},
"source": [
"Once we have the result, we can do batch inference on the obtained model. Let's define a utility function for this."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "59eeadd8",
"metadata": {},
"outputs": [],
"source": [
"def predict_sklearn(result: Result, use_gpu: bool = False):\n",
" _, _, test_dataset = prepare_data()\n",
"\n",
" batch_predictor = BatchPredictor.from_checkpoint(\n",
" result.checkpoint, SklearnPredictor\n",
" )\n",
"\n",
" predicted_labels = (\n",
" batch_predictor.predict(\n",
" test_dataset,\n",
" num_gpus_per_worker=int(use_gpu),\n",
" )\n",
" .map_batches(lambda df: (df > 0.5).astype(int), batch_format=\"pandas\")\n",
" .to_pandas(limit=float(\"inf\"))\n",
" )\n",
" print(f\"PREDICTED LABELS\\n{predicted_labels}\")"
]
},
{
"cell_type": "markdown",
"id": "7d073994",
"metadata": {},
"source": [
"Now we can run the training:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "43f9170a",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-05-19 11:56:26,664\tINFO services.py:1483 -- View the Ray dashboard at \u001B[1m\u001B[32mhttp://127.0.0.1:8266\u001B[39m\u001B[22m\n"
]
},
{
"data": {
"text/html": [
"== Status ==
Current time: 2022-05-19 11:56:51 (running for 00:00:20.56)
Memory usage on this node: 10.1/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.64 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/kai/ray_results/SklearnTrainer_2022-05-19_11-56-29
Number of trials: 1/1 (1 TERMINATED)
Trial name | status | loc | iter | total time (s) | fit_time |
---|---|---|---|---|---|
SklearnTrainer_564d9_00000 | TERMINATED | 127.0.0.1:12221 | 1 | 17.1905 | 2.48662 |