mirror of
https://github.com/vale981/ray
synced 2025-03-06 18:41:40 -05:00
146 lines
4.9 KiB
ReStructuredText
146 lines
4.9 KiB
ReStructuredText
![]() |
.. _xgboost-ray:
|
||
|
|
||
|
XGBoost on Ray
|
||
|
==============
|
||
|
|
||
|
This library adds a new backend for XGBoost utilizing Ray.
|
||
|
|
||
|
Please note that this is an early version and both the API and
|
||
|
the behavior can change without prior notice.
|
||
|
|
||
|
Installation
|
||
|
------------
|
||
|
|
||
|
You can install XGBoost on Ray (``xgboost_ray``) like this:
|
||
|
|
||
|
.. code-block:: bash
|
||
|
|
||
|
git clone https://github.com/ray-project/xgboost_ray.git
|
||
|
cd xgboost_ray
|
||
|
pip install -e .
|
||
|
|
||
|
|
||
|
Usage
|
||
|
-----
|
||
|
|
||
|
After installation, you can import XGBoost on Ray via two ways:
|
||
|
|
||
|
.. code-block:: bash
|
||
|
|
||
|
import xgboost_ray
|
||
|
# or
|
||
|
import ray.util.xgboost
|
||
|
|
||
|
|
||
|
``xgboost_ray`` provides a drop-in replacement for XGBoost's ``train``
|
||
|
function. To pass data, instead of using ``xgb.DMatrix`` you will
|
||
|
have to use ``ray.util.xgboost.RayDMatrix``.
|
||
|
|
||
|
Here is a simplified example:
|
||
|
|
||
|
|
||
|
.. literalinclude:: /../../python/ray/util/xgboost/simple_example.py
|
||
|
:language: python
|
||
|
:start-after: __xgboost_begin__
|
||
|
:end-before: __xgboost_end__
|
||
|
|
||
|
|
||
|
|
||
|
Data loading
|
||
|
------------
|
||
|
|
||
|
Data is passed to ``xgboost_ray`` via a ``RayDMatrix`` object.
|
||
|
|
||
|
The ``RayDMatrix`` lazy loads data and stores it sharded in the
|
||
|
Ray object store. The Ray XGBoost actors then access these
|
||
|
shards to run their training on.
|
||
|
|
||
|
A ``RayDMatrix`` support various data and file types, like
|
||
|
Pandas DataFrames, Numpy Arrays, CSV files and Parquet files.
|
||
|
|
||
|
Example loading multiple parquet files:
|
||
|
|
||
|
.. code-block:: python
|
||
|
|
||
|
import glob
|
||
|
from ray.util.xgboost import RayDMatrix, RayFileType
|
||
|
|
||
|
# We can also pass a list of files
|
||
|
path = list(sorted(glob.glob("/data/nyc-taxi/*/*/*.parquet")))
|
||
|
|
||
|
# This argument will be passed to pd.read_parquet()
|
||
|
columns = [
|
||
|
"passenger_count",
|
||
|
"trip_distance", "pickup_longitude", "pickup_latitude",
|
||
|
"dropoff_longitude", "dropoff_latitude",
|
||
|
"fare_amount", "extra", "mta_tax", "tip_amount",
|
||
|
"tolls_amount", "total_amount"
|
||
|
]
|
||
|
|
||
|
dtrain = RayDMatrix(
|
||
|
path,
|
||
|
label="passenger_count", # Will select this column as the label
|
||
|
columns=columns,
|
||
|
filetype=RayFileType.PARQUET)
|
||
|
|
||
|
|
||
|
Hyperparameter Tuning
|
||
|
---------------------
|
||
|
``xgboost_ray`` integrates with Ray Tune (:ref:`tune-main`) to provide distributed hyperparameter tuning for your
|
||
|
distributed XGBoost models. You can run multiple ``xgboost_ray`` training runs in parallel, each with a different
|
||
|
hyperparameter configuration, with each individual training run parallelized.
|
||
|
|
||
|
First, move your training code into a function. This function should take in a ``config`` argument which specifies the
|
||
|
hyperparameters for the xgboost model.
|
||
|
|
||
|
.. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py
|
||
|
:language: python
|
||
|
:start-after: __train_begin__
|
||
|
:end-before: __train_end__
|
||
|
|
||
|
Then, you import tune and use tune's search primitives to define a hyperparameter search space.
|
||
|
|
||
|
.. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py
|
||
|
:language: python
|
||
|
:start-after: __tune_begin__
|
||
|
:end-before: __tune_end__
|
||
|
|
||
|
Finally, you call ``tune.run`` passing in the training function and the ``config``. Internally, tune will resolve the
|
||
|
hyperparameter search space and invoke the training function multiple times, each with different hyperparameters.
|
||
|
|
||
|
.. literalinclude:: /../../python/ray/util/xgboost/simple_tune.py
|
||
|
:language: python
|
||
|
:start-after: __tune_run_begin__
|
||
|
:end-before: __tune_run_end__
|
||
|
|
||
|
Make sure you set the ``extra_cpu`` field appropriately so tune is aware of the total number of resources each trial
|
||
|
requires.
|
||
|
|
||
|
|
||
|
Resources
|
||
|
---------
|
||
|
|
||
|
By default, ``xgboost_ray`` tries to determine the number of CPUs
|
||
|
available and distributes them evenly across actors.
|
||
|
|
||
|
In the case of very large clusters or clusters with many different
|
||
|
machine sizes, it makes sense to limit the number of CPUs per actor
|
||
|
by setting the ``cpus_per_actor`` argument. Consider always
|
||
|
setting this explicitly.
|
||
|
|
||
|
The number of XGBoost actors always has to be set manually with
|
||
|
the ``num_actors`` argument.
|
||
|
|
||
|
More examples
|
||
|
-------------
|
||
|
|
||
|
Fore complete end to end examples, please have a look at
|
||
|
the `examples folder <https://github.com/ray-project/xgboost_ray/tree/master/examples/>`__:
|
||
|
|
||
|
* `Simple sklearn breastcancer dataset example <https://github.com/ray-project/xgboost_ray/tree/master/examples/simple.py>`__ (requires `sklearn`)
|
||
|
* `Simple sklearn breastcancer dataset example with Ray Tune <ttps://github.com/ray-project/xgboost_ray/tree/master/examples/simple_tune.py>`__ (requires `sklearn`)
|
||
|
* `HIGGS classification example <https://github.com/ray-project/xgboost_ray/tree/master/examples/higgs.py>`__
|
||
|
* `[download dataset (2.6 GB)] <https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz>`__
|
||
|
* `HIGGS classification example with Parquet <https://github.com/ray-project/xgboost_ray/tree/master/examples/higgs_parquet.py>`__ (uses the same dataset)
|
||
|
* `Test data classification <https://github.com/ray-project/xgboost_ray/tree/master/examples/train_on_test_data.py>`__ (uses a self-generated dataset)
|