# Training a model with distributed XGBoost
In this example we will train a model in Ray AIR using distributed XGBoost.

Let's start with installing our dependencies:

In [1]:
!pip install -qU "ray[tune]" xgboost_ray

Then we need some imports:

In [2]:
import argparse
from typing import Tuple

import pandas as pd

import ray
from ray.train.batch_predictor import BatchPredictor
from ray.train.xgboost import XGBoostPredictor
from ray.train.xgboost import XGBoostTrainer
from ray.data.dataset import Dataset
from ray.air.result import Result
from ray.data.preprocessors import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

Next we define a function to load our train, validation, and test datasets.

In [3]:
def prepare_data() -> Tuple[Dataset, Dataset, Dataset]:
    data_raw = load_breast_cancer()
    dataset_df = pd.DataFrame(data_raw["data"], columns=data_raw["feature_names"])
    dataset_df["target"] = data_raw["target"]
    train_df, test_df = train_test_split(dataset_df, test_size=0.3)
    train_dataset = ray.data.from_pandas(train_df)
    valid_dataset = ray.data.from_pandas(test_df)
    test_dataset = ray.data.from_pandas(test_df.drop("target", axis=1))
    return train_dataset, valid_dataset, test_dataset

The following function will create a XGBoost trainer, train it, and return the result.

In [4]:
def train_xgboost(num_workers: int, use_gpu: bool = False) -> Result:
    train_dataset, valid_dataset, _ = prepare_data()

    # Scale some random columns
    columns_to_scale = ["mean radius", "mean texture"]
    preprocessor = StandardScaler(columns=columns_to_scale)

    # XGBoost specific params
    params = {
        "tree_method": "approx",
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    }

    trainer = XGBoostTrainer(
        scaling_config={
            "num_workers": num_workers,
            "use_gpu": use_gpu,
        },
        label_column="target",
        params=params,
        datasets={"train": train_dataset, "valid": valid_dataset},
        preprocessor=preprocessor,
        num_boost_round=100,
    )
    result = trainer.fit()
    print(result.metrics)

    return result

Once we have the result, we can do batch inference on the obtained model. Let's define a utility function for this.

In [5]:
def predict_xgboost(result: Result):
    _, _, test_dataset = prepare_data()

    batch_predictor = BatchPredictor.from_checkpoint(
        result.checkpoint, XGBoostPredictor
    )

    predicted_labels = (
        batch_predictor.predict(test_dataset)
        .map_batches(lambda df: (df > 0.5).astype(int), batch_format="pandas")
        .to_pandas(limit=float("inf"))
    )
    print(f"PREDICTED LABELS\n{predicted_labels}")

    shap_values = batch_predictor.predict(test_dataset, pred_contribs=True).to_pandas(
        limit=float("inf")
    )
    print(f"SHAP VALUES\n{shap_values}")


Now we can run the training:

In [6]:
result = train_xgboost(num_workers=2, use_gpu=False)

2022-05-19 11:44:42,413	INFO services.py:1483 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


Trial name,status,loc,iter,total time (s),train-logloss,train-error,valid-logloss
XGBoostTrainer_b273b_00000,TERMINATED,127.0.0.1:11036,100,9.03935,0.005949,0,0.07483


[2m[33m(raylet)[0m 2022-05-19 11:44:47,554	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=54067 --object-store-name=/tmp/ray/session_2022-05-19_11-44-39_813259_10959/sockets/plasma_store --raylet-name=/tmp/ray/session_2022-05-19_11-44-39_813259_10959/sockets/raylet --redis-address=None --storage=None --temp-dir=/tmp/ray --metrics-agent-port=61242 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --gcs-address=127.0.0.1:61017 --redis-password=5241590000000000 --startup-token=16 --runtime-env-hash=-2010331134
[2m[33m(raylet)[0m 2022-05-19 11:44:51,603	INFO context.py:70 -- Exec'ing worker with command: exec /Users/kai/.pyenv/versions/3.7.7/bin/python3.7 /Users/kai/coding/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --node-manager-port=54067 --object-store-name=/tmp/ray/session_2022-

Result for XGBoostTrainer_b273b_00000:
  date: 2022-05-19_11-44-57
  done: false
  experiment_id: 991235d8b76649398688695ca70a08e4
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 1
  node_ip: 127.0.0.1
  pid: 11036
  should_checkpoint: true
  time_since_restore: 7.17207407951355
  time_this_iter_s: 7.17207407951355
  time_total_s: 7.17207407951355
  timestamp: 1652957097
  timesteps_since_restore: 0
  train-error: 0.020101
  train-logloss: 0.465715
  training_iteration: 1
  trial_id: b273b_00000
  valid-error: 0.052632
  valid-logloss: 0.480831
  warmup_time: 0.003935098648071289
  


[2m[36m(GBDTTrainable pid=11036)[0m 2022-05-19 11:44:59,796	INFO main.py:1519 -- [RayXGBoost] Finished XGBoost training on training data with total N=398 in 6.80 seconds (2.92 pure XGBoost training time).


Result for XGBoostTrainer_b273b_00000:
  date: 2022-05-19_11-44-59
  done: true
  experiment_id: 991235d8b76649398688695ca70a08e4
  experiment_tag: '0'
  hostname: Kais-MacBook-Pro.local
  iterations_since_restore: 100
  node_ip: 127.0.0.1
  pid: 11036
  should_checkpoint: true
  time_since_restore: 9.03934907913208
  time_this_iter_s: 0.018042802810668945
  time_total_s: 9.03934907913208
  timestamp: 1652957099
  timesteps_since_restore: 0
  train-error: 0.0
  train-logloss: 0.005949
  training_iteration: 100
  trial_id: b273b_00000
  valid-error: 0.017544
  valid-logloss: 0.07483
  warmup_time: 0.003935098648071289
  


2022-05-19 11:45:00,535	INFO tune.py:753 -- Total run time: 15.30 seconds (13.91 seconds for the tuning loop).


{'train-logloss': 0.005949, 'train-error': 0.0, 'valid-logloss': 0.07483, 'valid-error': 0.017544, 'time_this_iter_s': 0.018042802810668945, 'should_checkpoint': True, 'done': True, 'timesteps_total': None, 'episodes_total': None, 'training_iteration': 100, 'trial_id': 'b273b_00000', 'experiment_id': '991235d8b76649398688695ca70a08e4', 'date': '2022-05-19_11-44-59', 'timestamp': 1652957099, 'time_total_s': 9.03934907913208, 'pid': 11036, 'hostname': 'Kais-MacBook-Pro.local', 'node_ip': '127.0.0.1', 'config': {}, 'time_since_restore': 9.03934907913208, 'timesteps_since_restore': 0, 'iterations_since_restore': 100, 'warmup_time': 0.003935098648071289, 'experiment_tag': '0'}


And perform inference on the obtained model:

In [7]:
predict_xgboost(result)

Map Progress (1 actors 1 pending): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.96s/it]
Map_Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 87.81it/s]


PREDICTED LABELS
     predictions
0              0
1              0
2              1
3              1
4              0
..           ...
166            1
167            1
168            0
169            1
170            0

[171 rows x 1 columns]


Map Progress (1 actors 1 pending): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.78s/it]

SHAP VALUES
     predictions_0  predictions_1  predictions_2  predictions_3  \
0        -0.139882      -0.748878            0.0      -1.143079   
1         0.013840      -1.053747            0.0       0.361219   
2        -0.082575       0.952107            0.0       0.396908   
3         0.016314       0.916166            0.0       0.535740   
4        -0.087534       1.317693            0.0      -0.631737   
..             ...            ...            ...            ...   
166       0.016314       1.006091            0.0       0.535740   
167       0.010002       0.948294            0.0       0.529942   
168      -0.084481       0.766085            0.0      -0.582221   
169       0.010002       0.846374            0.0       0.502846   
170      -0.108186      -1.032712            0.0      -0.737255   

     predictions_4  predictions_5  predictions_6  predictions_7  \
0         0.228545       0.074653      -0.033109      -1.680274   
1        -0.386373       0.030964      -0.026341 


