{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "ff4fd80e", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "id": "24024bba", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "# flake8: noqa" ] }, { "cell_type": "markdown", "id": "0694daf8", "metadata": {}, "source": [ "# Tune's Scikit Learn Adapters\n", "\n", "Scikit-Learn is one of the most widely used tools in the ML community for working with data,\n", "offering dozens of easy-to-use machine learning algorithms.\n", "However, to achieve high performance for these algorithms, you often need to perform **model selection**.\n", "\n", "```{image} /images/tune-sklearn.png\n", ":align: center\n", ":width: 50%\n", "```\n", "\n", "Scikit-Learn [has an existing module for model selection](https://scikit-learn.org/stable/modules/grid_search.html),\n", "but the algorithms offered (Grid Search via``GridSearchCV`` and Random Search via``RandomizedSearchCV``)\n", "are often considered inefficient.\n", "In this tutorial, we'll cover ``tune-sklearn``, a drop-in replacement for Scikit-Learn's model selection module\n", "with state-of-the-art optimization features such as early stopping and Bayesian Optimization.\n", "\n", "```{tip} \n", "Check out the [tune-sklearn code](https://github.com/ray-project/tune-sklearn) and {ref}`documentation `.\n", "```\n", "\n", "## Overview\n", "\n", "``tune-sklearn`` is a module that integrates Ray Tune's hyperparameter tuning and scikit-learn's Classifier API. \n", "``tune-sklearn`` has two APIs: {ref}`TuneSearchCV `, and {ref}`TuneGridSearchCV `.\n", "They are drop-in replacements for Scikit-learn's RandomizedSearchCV and GridSearchCV, so you only need to change\n", "less than 5 lines in a standard Scikit-Learn script to use the API.\n", "\n", "Ray Tune's Scikit-learn APIs allows you to easily leverage Bayesian Optimization, HyperBand, and other cutting edge\n", "tuning techniques by simply toggling a few parameters. It also supports and provides examples for many other\n", "frameworks with Scikit-Learn wrappers such as Skorch (Pytorch), KerasClassifiers (Keras), and XGBoostClassifiers (XGBoost).\n", "\n", "Run ``pip install \"ray[tune]\" tune-sklearn`` to get started.\n", "\n", "## Walkthrough\n", "\n", "Let's compare Tune's Scikit-Learn APIs to the standard scikit-learn GridSearchCV. For this example, we'll be using\n", "``TuneGridSearchCV`` with a [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html).\n", "\n", "To start out, change the import statement to get tune-scikit-learn’s grid search cross validation interface:" ] }, { "cell_type": "code", "execution_count": null, "id": "7f6b2190", "metadata": {}, "outputs": [], "source": [ "# Keep this here for https://github.com/ray-project/ray/issues/11547\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "# Replace above line with:\n", "from ray.tune.sklearn import TuneGridSearchCV" ] }, { "cell_type": "markdown", "id": "ab2e6677", "metadata": {}, "source": [ "And from there, we would proceed just like how we would in Scikit-Learn’s interface!\n", "\n", "The `SGDClassifier` has a ``partial_fit`` API, which enables it to stop fitting to the data for a certain hyperparameter configuration.\n", "If the estimator does not support early stopping, we would fall back to a parallel grid search." ] }, { "cell_type": "code", "execution_count": null, "id": "30320f82", "metadata": {}, "outputs": [], "source": [ "# Other imports\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import SGDClassifier\n", "from sklearn.datasets import make_classification\n", "import numpy as np\n", "\n", "# Create dataset\n", "X, y = make_classification(\n", " n_samples=11000,\n", " n_features=1000,\n", " n_informative=50,\n", " n_redundant=0,\n", " n_classes=10,\n", " class_sep=2.5,\n", ")\n", "x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=1000)\n", "\n", "# Example parameters to tune from SGDClassifier\n", "parameter_grid = {\"alpha\": [1e-4, 1e-1, 1], \"epsilon\": [0.01, 0.1]}" ] }, { "cell_type": "markdown", "id": "abc84e87", "metadata": {}, "source": [ "As you can see, the setup here is exactly how you would do it for Scikit-Learn.\n", "Now, let's try fitting a model." ] }, { "cell_type": "code", "execution_count": null, "id": "4f81facf", "metadata": {}, "outputs": [], "source": [ "tune_search = TuneGridSearchCV(\n", " SGDClassifier(), parameter_grid, early_stopping=True, max_iters=10\n", ")\n", "\n", "import time # Just to compare fit times\n", "\n", "start = time.time()\n", "tune_search.fit(x_train, y_train)\n", "end = time.time()\n", "print(\"Tune GridSearch Fit Time:\", end - start)\n", "# Tune GridSearch Fit Time: 15.436315774917603 (for an 8 core laptop)" ] }, { "cell_type": "markdown", "id": "4c675139", "metadata": {}, "source": [ "Note the slight differences we introduced above:\n", "\n", " * a `early_stopping`, and\n", " * a specification of `max_iters` parameter\n", "\n", "The ``early_stopping`` parameter allows us to terminate unpromising configurations. If ``early_stopping=True``,\n", "TuneGridSearchCV will default to using Tune's ASHAScheduler.\n", "You can pass in a custom algorithm - see {ref}`Tune's documentation on schedulers ` here for a full list to choose from.\n", "``max_iters`` is the maximum number of iterations a given hyperparameter set could run for;\n", "it may run for fewer iterations if it is early stopped.\n", "\n", "Try running this compared to the GridSearchCV equivalent, and see the speedup for yourself!" ] }, { "cell_type": "code", "execution_count": null, "id": "d43b1ed3", "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "# n_jobs=-1 enables use of all cores like Tune does\n", "sklearn_search = GridSearchCV(SGDClassifier(), parameter_grid, n_jobs=-1)\n", "\n", "start = time.time()\n", "sklearn_search.fit(x_train, y_train)\n", "end = time.time()\n", "print(\"Sklearn Fit Time:\", end - start)\n", "# Sklearn Fit Time: 47.48055911064148 (for an 8 core laptop)" ] }, { "cell_type": "markdown", "id": "f31efdbf", "metadata": {}, "source": [ "## Using Bayesian Optimization\n", "\n", "In addition to the grid search interface, tune-sklearn also provides an interface,\n", "TuneSearchCV, for sampling from **distributions of hyperparameters**.\n", "In the following example we'll be using the [digits dataset from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html)\n", "\n", "In addition, you can easily enable Bayesian optimization over the distributions in only 2 lines of code:" ] }, { "cell_type": "code", "execution_count": null, "id": "5016e858", "metadata": {}, "outputs": [], "source": [ "# First run `pip install bayesian-optimization`\n", "from ray.tune.sklearn import TuneSearchCV\n", "from sklearn.linear_model import SGDClassifier\n", "from sklearn import datasets\n", "from sklearn.model_selection import train_test_split\n", "import numpy as np\n", "\n", "digits = datasets.load_digits()\n", "x = digits.data\n", "y = digits.target\n", "x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)\n", "\n", "clf = SGDClassifier()\n", "parameter_grid = {\"alpha\": (1e-4, 1), \"epsilon\": (0.01, 0.1)}\n", "\n", "tune_search = TuneSearchCV(\n", " clf,\n", " parameter_grid,\n", " search_optimization=\"bayesian\",\n", " n_trials=3,\n", " early_stopping=True,\n", " max_iters=10,\n", ")\n", "tune_search.fit(x_train, y_train)\n", "print(tune_search.best_params_)\n", "# {'alpha': 0.37460266483547777, 'epsilon': 0.09556428757689246}" ] }, { "cell_type": "markdown", "id": "ac30adce", "metadata": {}, "source": [ "As you can see, it’s very simple to integrate tune-sklearn into existing code.\n", "Distributed execution is also easy - you can simply run ``ray.init(address=\"auto\")`` before\n", "TuneSearchCV to connect to the Ray cluster and parallelize tuning across multiple nodes, as you would in any other Ray Tune script.\n", "\n", "## Code Examples\n", "\n", "Check out more detailed examples and get started with tune-sklearn!\n", "\n", "* [Skorch with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/torch_nn.py>)\n", "* [Scikit-Learn Pipelines with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/sklearn_pipeline.py>)\n", "* [XGBoost with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/xgbclassifier.py>)\n", "* [KerasClassifier with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/keras_example.py>)\n", "* [LightGBM with tune-sklearn](https://github.com/ray-project/tune-sklearn/blob/master/examples/lgbm.py>)\n", "\n", "## Further Reading\n", "\n", "If you're using scikit-learn for other tasks, take a look at Ray’s {ref}`replacement for joblib `,\n", "which allows users to parallelize scikit learn jobs over multiple nodes." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }