ray/doc/source/ray-more-libs/joblib.rst
Max Pumperla f9b71a8bf6
[docs] new structure (#21776)
This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way:

- [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign.
- [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).
2022-01-21 15:42:05 -08:00

69 lines
2.4 KiB
ReStructuredText

.. _ray-joblib:
Distributed Scikit-learn / Joblib
=================================
.. _`issue on GitHub`: https://github.com/ray-project/ray/issues
Ray supports running distributed `scikit-learn`_ programs by
implementing a Ray backend for `joblib`_ using `Ray Actors <actors.html>`__
instead of local processes. This makes it easy to scale existing applications
that use scikit-learn from a single node to a cluster.
.. note::
This API is new and may be revised in future Ray releases. If you encounter
any bugs, please file an `issue on GitHub`_.
.. _`joblib`: https://joblib.readthedocs.io
.. _`scikit-learn`: https://scikit-learn.org
Quickstart
----------
To get started, first `install Ray <installation.html>`__, then use
``from ray.util.joblib import register_ray`` and run ``register_ray()``.
This will register Ray as a joblib backend for scikit-learn to use.
Then run your original scikit-learn code inside
``with joblib.parallel_backend('ray')``. This will start a local Ray cluster.
See the `Run on a Cluster`_ section below for instructions to run on
a multi-node Ray cluster instead.
.. code-block:: python
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
digits = load_digits()
param_space = {
'C': np.logspace(-6, 6, 30),
'gamma': np.logspace(-8, 8, 30),
'tol': np.logspace(-4, -1, 30),
'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=5, n_iter=300, verbose=10)
import joblib
from ray.util.joblib import register_ray
register_ray()
with joblib.parallel_backend('ray'):
search.fit(digits.data, digits.target)
Run on a Cluster
----------------
This section assumes that you have a running Ray cluster. To start a Ray cluster,
please refer to the `cluster setup <cluster/index.html>`__ instructions.
To connect a scikit-learn to a running Ray cluster, you have to specify the address of the
head node by setting the ``RAY_ADDRESS`` environment variable.
You can also start Ray manually by calling ``ray.init()`` (with any of its supported
configuration options) before calling ``with joblib.parallel_backend('ray')``.
.. warning::
If you do not set the ``RAY_ADDRESS`` environment variable and do not provide
``address`` in ``ray.init(address=<address>)`` then scikit-learn will run on a SINGLE node!