ray/doc/source/tune/examples/tune-sklearn.ipynb
Max Pumperla 7d4296c72f
run code in browser (#22727)
Example for running notebooks on our docs directly in the browser by connecting to a binder instance launched on demand.
If this seems useful we can extend this to other examples gradually.

Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-02 10:27:00 +01:00

299 lines
No EOL
10 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "9cf0e5ac",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "29c1356a",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"# flake8: noqa"
]
},
{
"cell_type": "markdown",
"id": "a97b7427",
"metadata": {},
"source": [
"# Tune's Scikit Learn Adapters\n",
"\n",
"Scikit-Learn is one of the most widely used tools in the ML community for working with data,\n",
"offering dozens of easy-to-use machine learning algorithms.\n",
"However, to achieve high performance for these algorithms, you often need to perform **model selection**.\n",
"\n",
"```{image} /images/tune-sklearn.png\n",
":align: center\n",
":width: 50%\n",
"```\n",
"\n",
"```{contents}\n",
":backlinks: none\n",
":local: true\n",
"```\n",
"\n",
"Scikit-Learn [has an existing module for model selection](https://scikit-learn.org/stable/modules/grid_search.html),\n",
"but the algorithms offered (Grid Search via ``GridSearchCV`` and Random Search via ``RandomizedSearchCV``)\n",
"are often considered inefficient.\n",
"In this tutorial, we'll cover ``tune-sklearn``, a drop-in replacement for Scikit-Learn's model selection module\n",
"with state-of-the-art optimization features such as early stopping and Bayesian Optimization.\n",
"\n",
"```{tip} \n",
"Check out the [tune-sklearn code](https://github.com/ray-project/tune-sklearn) and {ref}`documentation <tune-sklearn-docs>`.\n",
"```\n",
"\n",
"## Overview\n",
"\n",
"``tune-sklearn`` is a module that integrates Ray Tune's hyperparameter tuning and scikit-learn's Classifier API. \n",
"``tune-sklearn`` has two APIs: {ref}`TuneSearchCV <tunesearchcv-docs>`, and {ref}`TuneGridSearchCV <tunegridsearchcv-docs>`.\n",
"They are drop-in replacements for Scikit-learn's RandomizedSearchCV and GridSearchCV, so you only need to change\n",
"less than 5 lines in a standard Scikit-Learn script to use the API.\n",
"\n",
"Ray Tune's Scikit-learn APIs allows you to easily leverage Bayesian Optimization, HyperBand, and other cutting edge\n",
"tuning techniques by simply toggling a few parameters. It also supports and provides examples for many other\n",
"frameworks with Scikit-Learn wrappers such as Skorch (Pytorch), KerasClassifiers (Keras), and XGBoostClassifiers (XGBoost).\n",
"\n",
"Run ``pip install \"ray[tune]\" tune-sklearn`` to get started.\n",
"\n",
"## Walkthrough\n",
"\n",
"Let's compare Tune's Scikit-Learn APIs to the standard scikit-learn GridSearchCV. For this example, we'll be using\n",
"``TuneGridSearchCV`` with a\n",
"[SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html).\n",
"\n",
"\n",
"```{thebe-button} Activate code examples\n",
"```\n",
"\n",
"To start out, change the import statement to get tune-scikit-learns grid search cross validation interface:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5a0cc1d8",
"metadata": {},
"outputs": [],
"source": [
"# Keep this here for https://github.com/ray-project/ray/issues/11547\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"# Replace above line with:\n",
"from ray.tune.sklearn import TuneGridSearchCV"
]
},
{
"cell_type": "markdown",
"id": "3a8c2610",
"metadata": {},
"source": [
"And from there, we would proceed just like how we would in Scikit-Learns interface!\n",
"\n",
"The `SGDClassifier` has a ``partial_fit`` API, which enables it to stop fitting to the data for a certain\n",
"hyperparameter configuration.\n",
"If the estimator does not support early stopping, we would fall back to a parallel grid search."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "712b215e",
"metadata": {},
"outputs": [],
"source": [
"# Other imports\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import SGDClassifier\n",
"from sklearn.datasets import make_classification\n",
"import numpy as np\n",
"\n",
"# Create dataset\n",
"X, y = make_classification(\n",
" n_samples=11000,\n",
" n_features=1000,\n",
" n_informative=50,\n",
" n_redundant=0,\n",
" n_classes=10,\n",
" class_sep=2.5,\n",
")\n",
"x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=1000)\n",
"\n",
"# Example parameters to tune from SGDClassifier\n",
"parameter_grid = {\"alpha\": [1e-4, 1e-1, 1], \"epsilon\": [0.01, 0.1]}"
]
},
{
"cell_type": "markdown",
"id": "79870ffb",
"metadata": {},
"source": [
"As you can see, the setup here is exactly how you would do it for Scikit-Learn.\n",
"Now, let's try fitting a model."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f2541b0",
"metadata": {},
"outputs": [],
"source": [
"tune_search = TuneGridSearchCV(\n",
" SGDClassifier(), parameter_grid, early_stopping=True, max_iters=10\n",
")\n",
"\n",
"import time # Just to compare fit times\n",
"\n",
"start = time.time()\n",
"tune_search.fit(x_train, y_train)\n",
"end = time.time()\n",
"print(\"Tune GridSearch Fit Time:\", end - start)\n",
"# Tune GridSearch Fit Time: 15.436315774917603 (for an 8 core laptop)"
]
},
{
"cell_type": "markdown",
"id": "831d6609",
"metadata": {},
"source": [
"Note the slight differences we introduced above:\n",
"\n",
" * a `early_stopping`, and\n",
" * a specification of `max_iters` parameter\n",
"\n",
"The ``early_stopping`` parameter allows us to terminate unpromising configurations. If ``early_stopping=True``,\n",
"TuneGridSearchCV will default to using Tune's ASHAScheduler.\n",
"You can pass in a custom algorithm - see {ref}`Tune's documentation on schedulers <tune-schedulers>`\n",
"here for a full list to choose from.\n",
"``max_iters`` is the maximum number of iterations a given hyperparameter set could run for;\n",
"it may run for fewer iterations if it is early stopped.\n",
"\n",
"Try running this compared to the GridSearchCV equivalent, and see the speedup for yourself!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bad624d5",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"# n_jobs=-1 enables use of all cores like Tune does\n",
"sklearn_search = GridSearchCV(SGDClassifier(), parameter_grid, n_jobs=-1)\n",
"\n",
"start = time.time()\n",
"sklearn_search.fit(x_train, y_train)\n",
"end = time.time()\n",
"print(\"Sklearn Fit Time:\", end - start)\n",
"# Sklearn Fit Time: 47.48055911064148 (for an 8 core laptop)"
]
},
{
"cell_type": "markdown",
"id": "328accb8",
"metadata": {},
"source": [
"## Using Bayesian Optimization\n",
"\n",
"In addition to the grid search interface, tune-sklearn also provides an interface,\n",
"TuneSearchCV, for sampling from **distributions of hyperparameters**.\n",
"In the following example we'll be using the\n",
"[digits dataset from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html)\n",
"\n",
"In addition, you can easily enable Bayesian optimization over the distributions in only 2 lines of code:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21ccda8d",
"metadata": {},
"outputs": [],
"source": [
"# First run `pip install bayesian-optimization`\n",
"from ray.tune.sklearn import TuneSearchCV\n",
"from sklearn.linear_model import SGDClassifier\n",
"from sklearn import datasets\n",
"from sklearn.model_selection import train_test_split\n",
"import numpy as np\n",
"\n",
"digits = datasets.load_digits()\n",
"x = digits.data\n",
"y = digits.target\n",
"x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)\n",
"\n",
"clf = SGDClassifier()\n",
"parameter_grid = {\"alpha\": (1e-4, 1), \"epsilon\": (0.01, 0.1)}\n",
"\n",
"tune_search = TuneSearchCV(\n",
" clf,\n",
" parameter_grid,\n",
" search_optimization=\"bayesian\",\n",
" n_trials=3,\n",
" early_stopping=True,\n",
" max_iters=10,\n",
")\n",
"tune_search.fit(x_train, y_train)\n",
"print(tune_search.best_params_)\n",
"# {'alpha': 0.37460266483547777, 'epsilon': 0.09556428757689246}"
]
},
{
"cell_type": "markdown",
"id": "0fb1dc0d",
"metadata": {},
"source": [
"As you can see, its very simple to integrate tune-sklearn into existing code.\n",
"Distributed execution is also easy - you can simply run ``ray.init(address=\"auto\")`` before\n",
"TuneSearchCV to connect to the Ray cluster and parallelize tuning across multiple nodes,\n",
"as you would in any other Ray Tune script.\n",
"\n",
"## More Scikit-Learn Examples\n",
"\n",
"See the [ray-project/tune-sklearn examples](https://github.com/ray-project/tune-sklearn/tree/master/examples)\n",
"for a comprehensive list of examples leveraging Tune's sklearn interface.\n",
"Check out more detailed examples and get started with tune-sklearn!\n",
"\n",
"* [Skorch with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/torch_nn.py)\n",
"* [Scikit-Learn Pipelines with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/sklearn_pipeline.py)\n",
"* [XGBoost with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/xgbclassifier.py)\n",
"* [KerasClassifier with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/keras_example.py)\n",
"* [LightGBM with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/lgbm.py)\n",
"\n",
"## Further Reading\n",
"\n",
"If you're using scikit-learn for other tasks, take a look at Rays {ref}`replacement for joblib <ray-joblib>`,\n",
"which allows users to parallelize scikit learn jobs over multiple nodes."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"orphan": true
},
"nbformat": 4,
"nbformat_minor": 5
}