ray/doc/source/tune/examples/tune-sklearn.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9cf0e5ac",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29c1356a",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# flake8: noqa"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a97b7427",
   "metadata": {},
   "source": [
    "# Tune's Scikit Learn Adapters\n",
    "\n",
    "Scikit-Learn is one of the most widely used tools in the ML community for working with data,\n",
    "offering dozens of easy-to-use machine learning algorithms.\n",
    "However, to achieve high performance for these algorithms, you often need to perform **model selection**.\n",
    "\n",
    "```{image} /images/tune-sklearn.png\n",
    ":align: center\n",
    ":width: 50%\n",
    "```\n",
    "\n",
    "```{contents}\n",
    ":backlinks: none\n",
    ":local: true\n",
    "```\n",
    "\n",
    "Scikit-Learn [has an existing module for model selection](https://scikit-learn.org/stable/modules/grid_search.html),\n",
    "but the algorithms offered (Grid Search via ``GridSearchCV`` and Random Search via ``RandomizedSearchCV``)\n",
    "are often considered inefficient.\n",
    "In this tutorial, we'll cover ``tune-sklearn``, a drop-in replacement for Scikit-Learn's model selection module\n",
    "with state-of-the-art optimization features such as early stopping and Bayesian Optimization.\n",
    "\n",
    "```{tip} \n",
    "Check out the [tune-sklearn code](https://github.com/ray-project/tune-sklearn) and {ref}`documentation <tune-sklearn-docs>`.\n",
    "```\n",
    "\n",
    "## Overview\n",
    "\n",
    "``tune-sklearn`` is a module that integrates Ray Tune's hyperparameter tuning and scikit-learn's Classifier API. \n",
    "``tune-sklearn`` has two APIs: {ref}`TuneSearchCV <tunesearchcv-docs>`, and {ref}`TuneGridSearchCV <tunegridsearchcv-docs>`.\n",
    "They are drop-in replacements for Scikit-learn's RandomizedSearchCV and GridSearchCV, so you only need to change\n",
    "less than 5 lines in a standard Scikit-Learn script to use the API.\n",
    "\n",
    "Ray Tune's Scikit-learn APIs allows you to easily leverage Bayesian Optimization, HyperBand, and other cutting edge\n",
    "tuning techniques by simply toggling a few parameters. It also supports and provides examples for many other\n",
    "frameworks with Scikit-Learn wrappers such as Skorch (Pytorch), KerasClassifiers (Keras), and XGBoostClassifiers (XGBoost).\n",
    "\n",
    "Run ``pip install \"ray[tune]\" tune-sklearn`` to get started.\n",
    "\n",
    "## Walkthrough\n",
    "\n",
    "Let's compare Tune's Scikit-Learn APIs to the standard scikit-learn GridSearchCV. For this example, we'll be using\n",
    "``TuneGridSearchCV`` with a\n",
    "[SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html).\n",
    "\n",
    "\n",
    "```{thebe-button} Activate code examples\n",
    "```\n",
    "\n",
    "To start out, change the import statement to get tune-scikit-learn’s grid search cross validation interface:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a0cc1d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Keep this here for https://github.com/ray-project/ray/issues/11547\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "# Replace above line with:\n",
    "from ray.tune.sklearn import TuneGridSearchCV"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a8c2610",
   "metadata": {},
   "source": [
    "And from there, we would proceed just like how we would in Scikit-Learn’s interface!\n",
    "\n",
    "The `SGDClassifier` has a ``partial_fit`` API, which enables it to stop fitting to the data for a certain\n",
    "hyperparameter configuration.\n",
    "If the estimator does not support early stopping, we would fall back to a parallel grid search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "712b215e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Other imports\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import SGDClassifier\n",
    "from sklearn.datasets import make_classification\n",
    "import numpy as np\n",
    "\n",
    "# Create dataset\n",
    "X, y = make_classification(\n",
    "    n_samples=11000,\n",
    "    n_features=1000,\n",
    "    n_informative=50,\n",
    "    n_redundant=0,\n",
    "    n_classes=10,\n",
    "    class_sep=2.5,\n",
    ")\n",
    "x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=1000)\n",
    "\n",
    "# Example parameters to tune from SGDClassifier\n",
    "parameter_grid = {\"alpha\": [1e-4, 1e-1, 1], \"epsilon\": [0.01, 0.1]}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79870ffb",
   "metadata": {},
   "source": [
    "As you can see, the setup here is exactly how you would do it for Scikit-Learn.\n",
    "Now, let's try fitting a model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f2541b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "tune_search = TuneGridSearchCV(\n",
    "    SGDClassifier(), parameter_grid, early_stopping=True, max_iters=10\n",
    ")\n",
    "\n",
    "import time  # Just to compare fit times\n",
    "\n",
    "start = time.time()\n",
    "tune_search.fit(x_train, y_train)\n",
    "end = time.time()\n",
    "print(\"Tune GridSearch Fit Time:\", end - start)\n",
    "# Tune GridSearch Fit Time: 15.436315774917603 (for an 8 core laptop)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "831d6609",
   "metadata": {},
   "source": [
    "Note the slight differences we introduced above:\n",
    "\n",
    " * a `early_stopping`, and\n",
    " * a specification of `max_iters` parameter\n",
    "\n",
    "The ``early_stopping`` parameter allows us to terminate unpromising configurations. If ``early_stopping=True``,\n",
    "TuneGridSearchCV will default to using Tune's ASHAScheduler.\n",
    "You can pass in a custom algorithm - see {ref}`Tune's documentation on schedulers <tune-schedulers>`\n",
    "here for a full list to choose from.\n",
    "``max_iters`` is the maximum number of iterations a given hyperparameter set could run for;\n",
    "it may run for fewer iterations if it is early stopped.\n",
    "\n",
    "Try running this compared to the GridSearchCV equivalent, and see the speedup for yourself!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bad624d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "# n_jobs=-1 enables use of all cores like Tune does\n",
    "sklearn_search = GridSearchCV(SGDClassifier(), parameter_grid, n_jobs=-1)\n",
    "\n",
    "start = time.time()\n",
    "sklearn_search.fit(x_train, y_train)\n",
    "end = time.time()\n",
    "print(\"Sklearn Fit Time:\", end - start)\n",
    "# Sklearn Fit Time: 47.48055911064148 (for an 8 core laptop)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "328accb8",
   "metadata": {},
   "source": [
    "## Using Bayesian Optimization\n",
    "\n",
    "In addition to the grid search interface, tune-sklearn also provides an interface,\n",
    "TuneSearchCV, for sampling from **distributions of hyperparameters**.\n",
    "In the following example we'll be using the\n",
    "[digits dataset from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html)\n",
    "\n",
    "In addition, you can easily enable Bayesian optimization over the distributions in only 2 lines of code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21ccda8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First run `pip install bayesian-optimization`\n",
    "from ray.tune.sklearn import TuneSearchCV\n",
    "from sklearn.linear_model import SGDClassifier\n",
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "import numpy as np\n",
    "\n",
    "digits = datasets.load_digits()\n",
    "x = digits.data\n",
    "y = digits.target\n",
    "x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)\n",
    "\n",
    "clf = SGDClassifier()\n",
    "parameter_grid = {\"alpha\": (1e-4, 1), \"epsilon\": (0.01, 0.1)}\n",
    "\n",
    "tune_search = TuneSearchCV(\n",
    "    clf,\n",
    "    parameter_grid,\n",
    "    search_optimization=\"bayesian\",\n",
    "    n_trials=3,\n",
    "    early_stopping=True,\n",
    "    max_iters=10,\n",
    ")\n",
    "tune_search.fit(x_train, y_train)\n",
    "print(tune_search.best_params_)\n",
    "# {'alpha': 0.37460266483547777, 'epsilon': 0.09556428757689246}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fb1dc0d",
   "metadata": {},
   "source": [
    "As you can see, it’s very simple to integrate tune-sklearn into existing code.\n",
    "Distributed execution is also easy - you can simply run ``ray.init(address=\"auto\")`` before\n",
    "TuneSearchCV to connect to the Ray cluster and parallelize tuning across multiple nodes,\n",
    "as you would in any other Ray Tune script.\n",
    "\n",
    "## More Scikit-Learn Examples\n",
    "\n",
    "See the [ray-project/tune-sklearn examples](https://github.com/ray-project/tune-sklearn/tree/master/examples)\n",
    "for a comprehensive list of examples leveraging Tune's sklearn interface.\n",
    "Check out more detailed examples and get started with tune-sklearn!\n",
    "\n",
    "* [Skorch with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/torch_nn.py)\n",
    "* [Scikit-Learn Pipelines with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/sklearn_pipeline.py)\n",
    "* [XGBoost with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/xgbclassifier.py)\n",
    "* [KerasClassifier with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/keras_example.py)\n",
    "* [LightGBM with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/lgbm.py)\n",
    "\n",
    "## Further Reading\n",
    "\n",
    "If you're using scikit-learn for other tasks, take a look at Ray’s {ref}`replacement for joblib <ray-joblib>`,\n",
    "which allows users to parallelize scikit learn jobs over multiple nodes."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "orphan": true
 },
 "nbformat": 4,
 "nbformat_minor": 5
}