.. _xgboost-ray: XGBoost on Ray ============== This library adds a new backend for XGBoost utilizing Ray. Please note that this is an early version and both the API and the behavior can change without prior notice. Installation ------------ You can install XGBoost on Ray (``xgboost_ray``) like this: .. code-block:: bash git clone https://github.com/ray-project/xgboost_ray.git cd xgboost_ray pip install -e . Usage ----- After installation, you can import XGBoost on Ray via two ways: .. code-block:: bash import xgboost_ray # or import ray.util.xgboost ``xgboost_ray`` provides a drop-in replacement for XGBoost's ``train`` function. To pass data, instead of using ``xgb.DMatrix`` you will have to use ``ray.util.xgboost.RayDMatrix``. Here is a simplified example: .. literalinclude:: /../../python/ray/util/xgboost/simple_example.py :language: python :start-after: __xgboost_begin__ :end-before: __xgboost_end__ Data loading ------------ Data is passed to ``xgboost_ray`` via a ``RayDMatrix`` object. The ``RayDMatrix`` lazy loads data and stores it sharded in the Ray object store. The Ray XGBoost actors then access these shards to run their training on. A ``RayDMatrix`` support various data and file types, like Pandas DataFrames, Numpy Arrays, CSV files and Parquet files. Example loading multiple parquet files: .. code-block:: python import glob from ray.util.xgboost import RayDMatrix, RayFileType # We can also pass a list of files path = list(sorted(glob.glob("/data/nyc-taxi/*/*/*.parquet"))) # This argument will be passed to pd.read_parquet() columns = [ "passenger_count", "trip_distance", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude", "fare_amount", "extra", "mta_tax", "tip_amount", "tolls_amount", "total_amount" ] dtrain = RayDMatrix( path, label="passenger_count", # Will select this column as the label columns=columns, filetype=RayFileType.PARQUET) Hyperparameter Tuning --------------------- ``xgboost_ray`` integrates with Ray Tune (:ref:`tune-main`) to provide distributed hyperparameter tuning for your distributed XGBoost models. You can run multiple ``xgboost_ray`` training runs in parallel, each with a different hyperparameter configuration, with each individual training run parallelized. First, move your training code into a function. This function should take in a ``config`` argument which specifies the hyperparameters for the xgboost model. .. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py :language: python :start-after: __train_begin__ :end-before: __train_end__ Then, you import tune and use tune's search primitives to define a hyperparameter search space. .. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py :language: python :start-after: __tune_begin__ :end-before: __tune_end__ Finally, you call ``tune.run`` passing in the training function and the ``config``. Internally, tune will resolve the hyperparameter search space and invoke the training function multiple times, each with different hyperparameters. .. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py :language: python :start-after: __tune_run_begin__ :end-before: __tune_run_end__ Make sure you set the ``extra_cpu`` field appropriately so tune is aware of the total number of resources each trial requires. Resources --------- By default, ``xgboost_ray`` tries to determine the number of CPUs available and distributes them evenly across actors. In the case of very large clusters or clusters with many different machine sizes, it makes sense to limit the number of CPUs per actor by setting the ``cpus_per_actor`` argument. Consider always setting this explicitly. The number of XGBoost actors always has to be set manually with the ``num_actors`` argument. More examples ------------- Fore complete end to end examples, please have a look at the `examples folder `__: * `Simple sklearn breastcancer dataset example `__ (requires `sklearn`) * `Simple sklearn breastcancer dataset example with Ray Tune `__ (requires `sklearn`) * `HIGGS classification example `__ * `[download dataset (2.6 GB)] `__ * `HIGGS classification example with Parquet `__ (uses the same dataset) * `Test data classification `__ (uses a self-generated dataset)