mirror of
https://github.com/vale981/ray
synced 2025-03-08 19:41:38 -05:00

This PR makes a number of major overhauls to the Ray core docs: Add a key-concepts section for {Tasks, Actors, Objects, Placement Groups, Env Deps}. Re-org the user guide to align with key concepts. Rewrite the walkthrough to link to mini-walkthroughs in the key concept sections. Minor tweaks and additional transition material.
812 lines
33 KiB
Text
812 lines
33 KiB
Text
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "aa6af4d3",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Using PyTorch Lightning with Tune\n",
|
|
"\n",
|
|
"(tune-pytorch-lightning-ref)=\n",
|
|
"\n",
|
|
"PyTorch Lightning is a framework which brings structure into training PyTorch models. It\n",
|
|
"aims to avoid boilerplate code, so you don't have to write the same training\n",
|
|
"loops all over again when building a new model.\n",
|
|
"\n",
|
|
"```{image} /images/pytorch_lightning_full.png\n",
|
|
":align: center\n",
|
|
"```\n",
|
|
"\n",
|
|
"The main abstraction of PyTorch Lightning is the `LightningModule` class, which\n",
|
|
"should be extended by your application. There is [a great post on how to transfer your models from vanilla PyTorch to Lightning](https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09).\n",
|
|
"\n",
|
|
"The class structure of PyTorch Lightning makes it very easy to define and tune model\n",
|
|
"parameters. This tutorial will show you how to use Tune to find the best set of\n",
|
|
"parameters for your application on the example of training a MNIST classifier. Notably,\n",
|
|
"the `LightningModule` does not have to be altered at all for this - so you can\n",
|
|
"use it plug and play for your existing models, assuming their parameters are configurable!\n",
|
|
"\n",
|
|
":::{note}\n",
|
|
"To run this example, you will need to install the following:\n",
|
|
"\n",
|
|
"```bash\n",
|
|
"$ pip install \"ray[tune]\" torch torchvision pytorch-lightning\n",
|
|
"```\n",
|
|
":::\n",
|
|
"\n",
|
|
":::{tip}\n",
|
|
"If you want distributed PyTorch Lightning Training on Ray in addition to hyperparameter tuning with Tune,\n",
|
|
"check out the [Ray Lightning Library](https://github.com/ray-project/ray_lightning)\n",
|
|
":::\n",
|
|
"\n",
|
|
"```{contents}\n",
|
|
":backlinks: none\n",
|
|
":local: true\n",
|
|
"```\n",
|
|
"\n",
|
|
"## PyTorch Lightning classifier for MNIST\n",
|
|
"\n",
|
|
"Let's first start with the basic PyTorch Lightning implementation of an MNIST classifier.\n",
|
|
"This classifier does not include any tuning code at this point.\n",
|
|
"\n",
|
|
"Our example builds on the MNIST example from the [blog post we talked about\n",
|
|
"earlier](https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09).\n",
|
|
"\n",
|
|
"First, we run some imports:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "e6e77570",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import math\n",
|
|
"\n",
|
|
"import torch\n",
|
|
"import pytorch_lightning as pl\n",
|
|
"from filelock import FileLock\n",
|
|
"from torch.utils.data import DataLoader, random_split\n",
|
|
"from torch.nn import functional as F\n",
|
|
"from torchvision.datasets import MNIST\n",
|
|
"from torchvision import transforms\n",
|
|
"import os"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3c442e73",
|
|
"metadata": {},
|
|
"source": [
|
|
"And then there is the Lightning model adapted from the blog post.\n",
|
|
"Note that we left out the test set validation and made the model parameters\n",
|
|
"configurable through a `config` dict that is passed on initialization.\n",
|
|
"Also, we specify a `data_dir` where the MNIST data will be stored. Note that\n",
|
|
"we use a `FileLock` for downloading data so that the dataset is only downloaded\n",
|
|
"once per node.\n",
|
|
"Lastly, we added a new metric, the validation accuracy, to the logs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "48b20f48",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class LightningMNISTClassifier(pl.LightningModule):\n",
|
|
" \"\"\"\n",
|
|
" This has been adapted from\n",
|
|
" https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09\n",
|
|
" \"\"\"\n",
|
|
"\n",
|
|
" def __init__(self, config, data_dir=None):\n",
|
|
" super(LightningMNISTClassifier, self).__init__()\n",
|
|
"\n",
|
|
" self.data_dir = data_dir or os.getcwd()\n",
|
|
"\n",
|
|
" self.layer_1_size = config[\"layer_1_size\"]\n",
|
|
" self.layer_2_size = config[\"layer_2_size\"]\n",
|
|
" self.lr = config[\"lr\"]\n",
|
|
" self.batch_size = config[\"batch_size\"]\n",
|
|
"\n",
|
|
" # mnist images are (1, 28, 28) (channels, width, height)\n",
|
|
" self.layer_1 = torch.nn.Linear(28 * 28, self.layer_1_size)\n",
|
|
" self.layer_2 = torch.nn.Linear(self.layer_1_size, self.layer_2_size)\n",
|
|
" self.layer_3 = torch.nn.Linear(self.layer_2_size, 10)\n",
|
|
"\n",
|
|
" def forward(self, x):\n",
|
|
" batch_size, channels, width, height = x.size()\n",
|
|
" x = x.view(batch_size, -1)\n",
|
|
"\n",
|
|
" x = self.layer_1(x)\n",
|
|
" x = torch.relu(x)\n",
|
|
"\n",
|
|
" x = self.layer_2(x)\n",
|
|
" x = torch.relu(x)\n",
|
|
"\n",
|
|
" x = self.layer_3(x)\n",
|
|
" x = torch.log_softmax(x, dim=1)\n",
|
|
"\n",
|
|
" return x\n",
|
|
"\n",
|
|
" def cross_entropy_loss(self, logits, labels):\n",
|
|
" return F.nll_loss(logits, labels)\n",
|
|
"\n",
|
|
" def accuracy(self, logits, labels):\n",
|
|
" _, predicted = torch.max(logits.data, 1)\n",
|
|
" correct = (predicted == labels).sum().item()\n",
|
|
" accuracy = correct / len(labels)\n",
|
|
" return torch.tensor(accuracy)\n",
|
|
"\n",
|
|
" def training_step(self, train_batch, batch_idx):\n",
|
|
" x, y = train_batch\n",
|
|
" logits = self.forward(x)\n",
|
|
" loss = self.cross_entropy_loss(logits, y)\n",
|
|
" accuracy = self.accuracy(logits, y)\n",
|
|
"\n",
|
|
" self.log(\"ptl/train_loss\", loss)\n",
|
|
" self.log(\"ptl/train_accuracy\", accuracy)\n",
|
|
" return loss\n",
|
|
"\n",
|
|
" def validation_step(self, val_batch, batch_idx):\n",
|
|
" x, y = val_batch\n",
|
|
" logits = self.forward(x)\n",
|
|
" loss = self.cross_entropy_loss(logits, y)\n",
|
|
" accuracy = self.accuracy(logits, y)\n",
|
|
" return {\"val_loss\": loss, \"val_accuracy\": accuracy}\n",
|
|
"\n",
|
|
" def validation_epoch_end(self, outputs):\n",
|
|
" avg_loss = torch.stack([x[\"val_loss\"] for x in outputs]).mean()\n",
|
|
" avg_acc = torch.stack([x[\"val_accuracy\"] for x in outputs]).mean()\n",
|
|
" self.log(\"ptl/val_loss\", avg_loss)\n",
|
|
" self.log(\"ptl/val_accuracy\", avg_acc)\n",
|
|
"\n",
|
|
" @staticmethod\n",
|
|
" def download_data(data_dir):\n",
|
|
" transform = transforms.Compose([\n",
|
|
" transforms.ToTensor(),\n",
|
|
" transforms.Normalize((0.1307, ), (0.3081, ))\n",
|
|
" ])\n",
|
|
" with FileLock(os.path.expanduser(\"~/.data.lock\")):\n",
|
|
" return MNIST(data_dir, train=True, download=True, transform=transform)\n",
|
|
"\n",
|
|
" def prepare_data(self):\n",
|
|
" mnist_train = self.download_data(self.data_dir)\n",
|
|
"\n",
|
|
" self.mnist_train, self.mnist_val = random_split(\n",
|
|
" mnist_train, [55000, 5000])\n",
|
|
"\n",
|
|
" def train_dataloader(self):\n",
|
|
" return DataLoader(self.mnist_train, batch_size=int(self.batch_size))\n",
|
|
"\n",
|
|
" def val_dataloader(self):\n",
|
|
" return DataLoader(self.mnist_val, batch_size=int(self.batch_size))\n",
|
|
"\n",
|
|
" def configure_optimizers(self):\n",
|
|
" optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)\n",
|
|
" return optimizer\n",
|
|
"\n",
|
|
"\n",
|
|
"def train_mnist(config):\n",
|
|
" model = LightningMNISTClassifier(config)\n",
|
|
" trainer = pl.Trainer(max_epochs=10, enable_progress_bar=False)\n",
|
|
"\n",
|
|
" trainer.fit(model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "da1c3632",
|
|
"metadata": {},
|
|
"source": [
|
|
"And that's it! You can now run `train_mnist(config)` to train the classifier, e.g.\n",
|
|
"like so:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "86df3d39",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def train_mnist_no_tune():\n",
|
|
" config = {\n",
|
|
" \"layer_1_size\": 128,\n",
|
|
" \"layer_2_size\": 256,\n",
|
|
" \"lr\": 1e-3,\n",
|
|
" \"batch_size\": 64\n",
|
|
" }\n",
|
|
" train_mnist(config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "edcc0991",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Tuning the model parameters\n",
|
|
"\n",
|
|
"The parameters above should give you a good accuracy of over 90% already. However,\n",
|
|
"we might improve on this simply by changing some of the hyperparameters. For instance,\n",
|
|
"maybe we get an even higher accuracy if we used a larger batch size.\n",
|
|
"\n",
|
|
"Instead of guessing the parameter values, let's use Tune to systematically try out\n",
|
|
"parameter combinations and find the best performing set.\n",
|
|
"\n",
|
|
"First, we need some additional imports:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "34faeb3b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from pytorch_lightning.loggers import TensorBoardLogger\n",
|
|
"from ray import tune\n",
|
|
"from ray.tune import CLIReporter\n",
|
|
"from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining\n",
|
|
"from ray.tune.integration.pytorch_lightning import TuneReportCallback, \\\n",
|
|
" TuneReportCheckpointCallback"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f65b9c5f",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Talking to Tune with a PyTorch Lightning callback\n",
|
|
"\n",
|
|
"PyTorch Lightning introduced [Callbacks](https://pytorch-lightning.readthedocs.io/en/latest/extensions/callbacks.html)\n",
|
|
"that can be used to plug custom functions into the training loop. This way the original\n",
|
|
"`LightningModule` does not have to be altered at all. Also, we could use the same\n",
|
|
"callback for multiple modules.\n",
|
|
"\n",
|
|
"Ray Tune comes with ready-to-use PyTorch Lightning callbacks. To report metrics\n",
|
|
"back to Tune after each validation epoch, we will use the `TuneReportCallback`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4bab80bc",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"TuneReportCallback(\n",
|
|
" {\n",
|
|
" \"loss\": \"ptl/val_loss\",\n",
|
|
" \"mean_accuracy\": \"ptl/val_accuracy\"\n",
|
|
" },\n",
|
|
" on=\"validation_end\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "286a1070",
|
|
"metadata": {},
|
|
"source": [
|
|
"This callback will take the `val_loss` and `val_accuracy` values\n",
|
|
"from the PyTorch Lightning trainer and report them to Tune as the `loss`\n",
|
|
"and `mean_accuracy`, respectively.\n",
|
|
"\n",
|
|
"### Adding the Tune training function\n",
|
|
"\n",
|
|
"Then we specify our training function. Note that we added the `data_dir` as a\n",
|
|
"parameter here to avoid\n",
|
|
"that each training run downloads the full MNIST dataset. Instead, we want to access\n",
|
|
"a shared data location.\n",
|
|
"\n",
|
|
"We are also able to specify the number of epochs to train each model, and the number\n",
|
|
"of GPUs we want to use for training. We also create a TensorBoard logger that writes\n",
|
|
"logfiles directly into Tune's root trial directory - if we didn't do that PyTorch\n",
|
|
"Lightning would create subdirectories, and each trial would thus be shown twice in\n",
|
|
"TensorBoard, one time for Tune's logs, and another time for PyTorch Lightning's logs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "74e7d1c2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def train_mnist_tune(config, num_epochs=10, num_gpus=0, data_dir=\"~/data\"):\n",
|
|
" data_dir = os.path.expanduser(data_dir)\n",
|
|
" model = LightningMNISTClassifier(config, data_dir)\n",
|
|
" trainer = pl.Trainer(\n",
|
|
" max_epochs=num_epochs,\n",
|
|
" # If fractional GPUs passed in, convert to int.\n",
|
|
" gpus=math.ceil(num_gpus),\n",
|
|
" logger=TensorBoardLogger(\n",
|
|
" save_dir=tune.get_trial_dir(), name=\"\", version=\".\"),\n",
|
|
" enable_progress_bar=False,\n",
|
|
" callbacks=[\n",
|
|
" TuneReportCallback(\n",
|
|
" {\n",
|
|
" \"loss\": \"ptl/val_loss\",\n",
|
|
" \"mean_accuracy\": \"ptl/val_accuracy\"\n",
|
|
" },\n",
|
|
" on=\"validation_end\")\n",
|
|
" ])\n",
|
|
" trainer.fit(model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cf0f6d6e",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Configuring the search space\n",
|
|
"\n",
|
|
"Now we configure the parameter search space. We would like to choose between three\n",
|
|
"different layer and batch sizes. The learning rate should be sampled uniformly between\n",
|
|
"`0.0001` and `0.1`. The `tune.loguniform()` function is syntactic sugar to make\n",
|
|
"sampling between these different orders of magnitude easier, specifically\n",
|
|
"we are able to also sample small values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "a50645e9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"config = {\n",
|
|
" \"layer_1_size\": tune.choice([32, 64, 128]),\n",
|
|
" \"layer_2_size\": tune.choice([64, 128, 256]),\n",
|
|
" \"lr\": tune.loguniform(1e-4, 1e-1),\n",
|
|
" \"batch_size\": tune.choice([32, 64, 128]),\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b1fb9ecd",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Selecting a scheduler\n",
|
|
"\n",
|
|
"In this example, we use an [Asynchronous Hyperband](https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/)\n",
|
|
"scheduler. This scheduler decides at each iteration which trials are likely to perform\n",
|
|
"badly, and stops these trials. This way we don't waste any resources on bad hyperparameter\n",
|
|
"configurations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "a2596b01",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"num_epochs = 10\n",
|
|
"\n",
|
|
"scheduler = ASHAScheduler(\n",
|
|
" max_t=num_epochs,\n",
|
|
" grace_period=1,\n",
|
|
" reduction_factor=2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9a49ae58",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Changing the CLI output\n",
|
|
"\n",
|
|
"We instantiate a `CLIReporter` to specify which metrics we would like to see in our\n",
|
|
"output tables in the command line. This is optional, but can be used to make sure our\n",
|
|
"output tables only include information we would like to see."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cd605a16",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"reporter = CLIReporter(\n",
|
|
" parameter_columns=[\"layer_1_size\", \"layer_2_size\", \"lr\", \"batch_size\"],\n",
|
|
" metric_columns=[\"loss\", \"mean_accuracy\", \"training_iteration\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5ec9a305",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Passing constants to the train function\n",
|
|
"\n",
|
|
"The `data_dir`, `num_epochs` and `num_gpus` we pass to the training function\n",
|
|
"are constants. To avoid including them as non-configurable parameters in the `config`\n",
|
|
"specification, we can use `tune.with_parameters` to wrap around the training function."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "332668dc",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"gpus_per_trial = 0\n",
|
|
"data_dir = \"~/data\"\n",
|
|
"\n",
|
|
"train_fn_with_parameters = tune.with_parameters(train_mnist_tune,\n",
|
|
" num_epochs=num_epochs,\n",
|
|
" num_gpus=gpus_per_trial,\n",
|
|
" data_dir=data_dir)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "feef8c39",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Training with GPUs\n",
|
|
"\n",
|
|
"We can specify how many resources Tune should request for each trial.\n",
|
|
"This also includes GPUs.\n",
|
|
"\n",
|
|
"PyTorch Lightning takes care of moving the training to the GPUs. We\n",
|
|
"already made sure that our code is compatible with that, so there's\n",
|
|
"nothing more to do here other than to specify the number of GPUs\n",
|
|
"we would like to use:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "dc402716",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"resources_per_trial = {\"cpu\": 1, \"gpu\": gpus_per_trial}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ca050dfa",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can also specify {doc}`fractional GPUs for Tune <../../ray-core/tasks/using-ray-with-gpus>`,\n",
|
|
"allowing multiple trials to share GPUs and thus increase concurrency under resource constraints.\n",
|
|
"While the `gpus_per_trial` passed into\n",
|
|
"Tune is a decimal value, the `gpus` passed into the `pl.Trainer` should still be an integer.\n",
|
|
"Please note that if using fractional GPUs, it is the user's responsibility to\n",
|
|
"make sure multiple trials can share GPUs and there is enough memory to do so.\n",
|
|
"Ray does not automatically handle this for you.\n",
|
|
"\n",
|
|
"If you want to use multiple GPUs per trial, you should check out the\n",
|
|
"[Ray Lightning Library](https://github.com/ray-project/ray_lightning).\n",
|
|
"This library makes it easy to run multiple concurrent trials with Ray Tune, with each trial also running\n",
|
|
"in a distributed fashion using Ray.\n",
|
|
"\n",
|
|
"### Putting it together\n",
|
|
"\n",
|
|
"Lastly, we need to start Tune with `tune.run()`.\n",
|
|
"\n",
|
|
"The full code looks like this:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "ea182330",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def tune_mnist_asha(num_samples=10, num_epochs=10, gpus_per_trial=0, data_dir=\"~/data\"):\n",
|
|
" config = {\n",
|
|
" \"layer_1_size\": tune.choice([32, 64, 128]),\n",
|
|
" \"layer_2_size\": tune.choice([64, 128, 256]),\n",
|
|
" \"lr\": tune.loguniform(1e-4, 1e-1),\n",
|
|
" \"batch_size\": tune.choice([32, 64, 128]),\n",
|
|
" }\n",
|
|
"\n",
|
|
" scheduler = ASHAScheduler(\n",
|
|
" max_t=num_epochs,\n",
|
|
" grace_period=1,\n",
|
|
" reduction_factor=2)\n",
|
|
"\n",
|
|
" reporter = CLIReporter(\n",
|
|
" parameter_columns=[\"layer_1_size\", \"layer_2_size\", \"lr\", \"batch_size\"],\n",
|
|
" metric_columns=[\"loss\", \"mean_accuracy\", \"training_iteration\"])\n",
|
|
"\n",
|
|
" train_fn_with_parameters = tune.with_parameters(train_mnist_tune,\n",
|
|
" num_epochs=num_epochs,\n",
|
|
" num_gpus=gpus_per_trial,\n",
|
|
" data_dir=data_dir)\n",
|
|
" resources_per_trial = {\"cpu\": 1, \"gpu\": gpus_per_trial}\n",
|
|
"\n",
|
|
" analysis = tune.run(train_fn_with_parameters,\n",
|
|
" resources_per_trial=resources_per_trial,\n",
|
|
" metric=\"loss\",\n",
|
|
" mode=\"min\",\n",
|
|
" config=config,\n",
|
|
" num_samples=num_samples,\n",
|
|
" scheduler=scheduler,\n",
|
|
" progress_reporter=reporter,\n",
|
|
" name=\"tune_mnist_asha\")\n",
|
|
"\n",
|
|
" print(\"Best hyperparameters found were: \", analysis.best_config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1fb96b6c",
|
|
"metadata": {},
|
|
"source": [
|
|
"In the example above, Tune runs 10 trials with different hyperparameter configurations.\n",
|
|
"An example output could look like so:\n",
|
|
"\n",
|
|
"```{code-block} bash\n",
|
|
":emphasize-lines: 12\n",
|
|
"\n",
|
|
" +------------------------------+------------+-------+----------------+----------------+-------------+--------------+----------+-----------------+----------------------+\n",
|
|
" | Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
" |------------------------------+------------+-------+----------------+----------------+-------------+--------------+----------+-----------------+----------------------|\n",
|
|
" | train_mnist_tune_63ecc_00000 | TERMINATED | | 128 | 64 | 0.00121197 | 128 | 0.120173 | 0.972461 | 10 |\n",
|
|
" | train_mnist_tune_63ecc_00001 | TERMINATED | | 64 | 128 | 0.0301395 | 128 | 0.454836 | 0.868164 | 4 |\n",
|
|
" | train_mnist_tune_63ecc_00002 | TERMINATED | | 64 | 128 | 0.0432097 | 128 | 0.718396 | 0.718359 | 1 |\n",
|
|
" | train_mnist_tune_63ecc_00003 | TERMINATED | | 32 | 128 | 0.000294669 | 32 | 0.111475 | 0.965764 | 10 |\n",
|
|
" | train_mnist_tune_63ecc_00004 | TERMINATED | | 32 | 256 | 0.000386664 | 64 | 0.133538 | 0.960839 | 8 |\n",
|
|
" | train_mnist_tune_63ecc_00005 | TERMINATED | | 128 | 128 | 0.0837395 | 32 | 2.32628 | 0.0991242 | 1 |\n",
|
|
" | train_mnist_tune_63ecc_00006 | TERMINATED | | 64 | 128 | 0.000158761 | 128 | 0.134595 | 0.959766 | 10 |\n",
|
|
" | train_mnist_tune_63ecc_00007 | TERMINATED | | 64 | 64 | 0.000672126 | 64 | 0.118182 | 0.972903 | 10 |\n",
|
|
" | train_mnist_tune_63ecc_00008 | TERMINATED | | 128 | 64 | 0.000502428 | 32 | 0.11082 | 0.975518 | 10 |\n",
|
|
" | train_mnist_tune_63ecc_00009 | TERMINATED | | 64 | 256 | 0.00112894 | 32 | 0.13472 | 0.971935 | 8 |\n",
|
|
" +------------------------------+------------+-------+----------------+----------------+-------------+--------------+----------+-----------------+----------------------+\n",
|
|
"```\n",
|
|
"\n",
|
|
"As you can see in the `training_iteration` column, trials with a high loss\n",
|
|
"(and low accuracy) have been terminated early. The best performing trial used\n",
|
|
"`layer_1_size=128`, `layer_2_size=64`, `lr=0.000502428` and\n",
|
|
"`batch_size=32`.\n",
|
|
"\n",
|
|
"## Using Population Based Training to find the best parameters\n",
|
|
"\n",
|
|
"The `ASHAScheduler` terminates those trials early that show bad performance.\n",
|
|
"Sometimes, this stops trials that would get better after more training steps,\n",
|
|
"and which might eventually even show better performance than other configurations.\n",
|
|
"\n",
|
|
"Another popular method for hyperparameter tuning, called\n",
|
|
"[Population Based Training](https://deepmind.com/blog/article/population-based-training-neural-networks),\n",
|
|
"instead perturbs hyperparameters during the training run. Tune implements PBT, and\n",
|
|
"we only need to make some slight adjustments to our code.\n",
|
|
"\n",
|
|
"### Adding checkpoints to the PyTorch Lightning module\n",
|
|
"\n",
|
|
"First, we need to introduce\n",
|
|
"another callback to save model checkpoints. Since Tune requires a call to\n",
|
|
"`tune.report()` after creating a new checkpoint to register it, we will use\n",
|
|
"a combined reporting and checkpointing callback:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7f86e4d8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"TuneReportCheckpointCallback(\n",
|
|
" metrics={\n",
|
|
" \"loss\": \"ptl/val_loss\",\n",
|
|
" \"mean_accuracy\": \"ptl/val_accuracy\"\n",
|
|
" },\n",
|
|
" filename=\"checkpoint\",\n",
|
|
" on=\"validation_end\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "33a76d5b",
|
|
"metadata": {},
|
|
"source": [
|
|
"The `checkpoint` value is the name of the checkpoint file within the\n",
|
|
"checkpoint directory.\n",
|
|
"\n",
|
|
"We also include checkpoint loading in our training function:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "746e962a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def train_mnist_tune_checkpoint(config,\n",
|
|
" checkpoint_dir=None,\n",
|
|
" num_epochs=10,\n",
|
|
" num_gpus=0,\n",
|
|
" data_dir=\"~/data\"):\n",
|
|
" data_dir = os.path.expanduser(data_dir)\n",
|
|
" kwargs = {\n",
|
|
" \"max_epochs\": num_epochs,\n",
|
|
" # If fractional GPUs passed in, convert to int.\n",
|
|
" \"gpus\": math.ceil(num_gpus),\n",
|
|
" \"logger\": TensorBoardLogger(\n",
|
|
" save_dir=tune.get_trial_dir(), name=\"\", version=\".\"),\n",
|
|
" \"enable_progress_bar\": False,\n",
|
|
" \"callbacks\": [\n",
|
|
" TuneReportCheckpointCallback(\n",
|
|
" metrics={\n",
|
|
" \"loss\": \"ptl/val_loss\",\n",
|
|
" \"mean_accuracy\": \"ptl/val_accuracy\"\n",
|
|
" },\n",
|
|
" filename=\"checkpoint\",\n",
|
|
" on=\"validation_end\")\n",
|
|
" ]\n",
|
|
" }\n",
|
|
"\n",
|
|
" if checkpoint_dir:\n",
|
|
" kwargs[\"resume_from_checkpoint\"] = os.path.join(\n",
|
|
" checkpoint_dir, \"checkpoint\")\n",
|
|
"\n",
|
|
" model = LightningMNISTClassifier(config=config, data_dir=data_dir)\n",
|
|
" trainer = pl.Trainer(**kwargs)\n",
|
|
"\n",
|
|
" trainer.fit(model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "39dc7b46",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Configuring and running Population Based Training\n",
|
|
"\n",
|
|
"We need to call Tune slightly differently:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "e12a1bd5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def tune_mnist_pbt(num_samples=10, num_epochs=10, gpus_per_trial=0, data_dir=\"~/data\"):\n",
|
|
" config = {\n",
|
|
" \"layer_1_size\": tune.choice([32, 64, 128]),\n",
|
|
" \"layer_2_size\": tune.choice([64, 128, 256]),\n",
|
|
" \"lr\": 1e-3,\n",
|
|
" \"batch_size\": 64,\n",
|
|
" }\n",
|
|
"\n",
|
|
" scheduler = PopulationBasedTraining(\n",
|
|
" perturbation_interval=4,\n",
|
|
" hyperparam_mutations={\n",
|
|
" \"lr\": tune.loguniform(1e-4, 1e-1),\n",
|
|
" \"batch_size\": [32, 64, 128]\n",
|
|
" })\n",
|
|
"\n",
|
|
" reporter = CLIReporter(\n",
|
|
" parameter_columns=[\"layer_1_size\", \"layer_2_size\", \"lr\", \"batch_size\"],\n",
|
|
" metric_columns=[\"loss\", \"mean_accuracy\", \"training_iteration\"])\n",
|
|
"\n",
|
|
" analysis = tune.run(\n",
|
|
" tune.with_parameters(\n",
|
|
" train_mnist_tune_checkpoint,\n",
|
|
" num_epochs=num_epochs,\n",
|
|
" num_gpus=gpus_per_trial,\n",
|
|
" data_dir=data_dir),\n",
|
|
" resources_per_trial={\n",
|
|
" \"cpu\": 1,\n",
|
|
" \"gpu\": gpus_per_trial\n",
|
|
" },\n",
|
|
" metric=\"loss\",\n",
|
|
" mode=\"min\",\n",
|
|
" config=config,\n",
|
|
" num_samples=num_samples,\n",
|
|
" scheduler=scheduler,\n",
|
|
" progress_reporter=reporter,\n",
|
|
" name=\"tune_mnist_pbt\")\n",
|
|
"\n",
|
|
" print(\"Best hyperparameters found were: \", analysis.best_config)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6087f807",
|
|
"metadata": {},
|
|
"source": [
|
|
"Instead of passing tune parameters to the `config` dict, we start\n",
|
|
"with fixed values, though we are also able to sample some of them, like the\n",
|
|
"layer sizes. Additionally, we have to tell PBT how to perturb the hyperparameters.\n",
|
|
"Note that the layer sizes are not tuned right here. This is because we cannot simply\n",
|
|
"change layer sizes during a training run - which is what would happen in PBT.\n",
|
|
"\n",
|
|
"To test running both of our main scripts (`tune_mnist_asha` and `tune_mnist_pbt`), all you have to do is specify\n",
|
|
"a `data_dir` folder and run the scripts with reasonable parameters:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"outputs": [],
|
|
"source": [
|
|
"data_dir = \"~/data/\"\n",
|
|
"\n",
|
|
"tune_mnist_asha(num_samples=1, num_epochs=6, gpus_per_trial=0, data_dir=data_dir)\n",
|
|
"tune_mnist_pbt(num_samples=1, num_epochs=6, gpus_per_trial=0, data_dir=data_dir)"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"If you have more resources available (e.g. a GPU), you can modify the above parameters accordingly.\n",
|
|
"\n",
|
|
"An example output of a run could look like this:\n",
|
|
"\n",
|
|
"```bash\n",
|
|
"+-----------------------------------------+------------+-------+----------------+----------------+-----------+--------------+-----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|-----------------------------------------+------------+-------+----------------+----------------+-----------+--------------+-----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_checkpoint_85489_00000 | TERMINATED | | 128 | 128 | 0.001 | 64 | 0.108734 | 0.973101 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00001 | TERMINATED | | 128 | 128 | 0.001 | 64 | 0.093577 | 0.978639 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00002 | TERMINATED | | 128 | 256 | 0.0008 | 32 | 0.0922348 | 0.979299 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00003 | TERMINATED | | 64 | 256 | 0.001 | 64 | 0.124648 | 0.973892 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00004 | TERMINATED | | 128 | 64 | 0.001 | 64 | 0.101717 | 0.975079 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00005 | TERMINATED | | 64 | 64 | 0.001 | 64 | 0.121467 | 0.969146 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00006 | TERMINATED | | 128 | 256 | 0.00064 | 32 | 0.053446 | 0.987062 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00007 | TERMINATED | | 128 | 256 | 0.001 | 64 | 0.129804 | 0.973497 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00008 | TERMINATED | | 64 | 256 | 0.0285125 | 128 | 0.363236 | 0.913867 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00009 | TERMINATED | | 32 | 256 | 0.001 | 64 | 0.150946 | 0.964201 | 10 |\n",
|
|
"+-----------------------------------------+------------+-------+----------------+----------------+-----------+--------------+-----------+-----------------+----------------------+\n",
|
|
"```\n",
|
|
"\n",
|
|
"As you can see, each sample ran the full number of 10 iterations.\n",
|
|
"All trials ended with quite good parameter combinations and showed relatively good performances.\n",
|
|
"In some runs, the parameters have been perturbed. And the best configuration even reached a\n",
|
|
"mean validation accuracy of `0.987062`!\n",
|
|
"\n",
|
|
"In summary, PyTorch Lightning Modules are easy to extend to use with Tune. It just took\n",
|
|
"us importing one or two callbacks and a small wrapper function to get great performing\n",
|
|
"parameter configurations.\n",
|
|
"\n",
|
|
"## More PyTorch Lightning Examples\n",
|
|
"\n",
|
|
"- {doc}`/tune/examples/includes/mnist_ptl_mini`:\n",
|
|
" A minimal example of using [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)\n",
|
|
" to train a MNIST model. This example utilizes the Ray Tune-provided\n",
|
|
" {ref}`PyTorch Lightning callbacks <tune-integration-pytorch-lightning>`.\n",
|
|
" See also {ref}`this tutorial for a full walkthrough <tune-pytorch-lightning-ref>`.\n",
|
|
"- {ref}`A walkthrough tutorial for using Ray Tune with Pytorch-Lightning <tune-pytorch-lightning-ref>`.\n",
|
|
"- {doc}`/tune/examples/includes/mlflow_ptl_example`: Example for using [MLflow](https://github.com/mlflow/mlflow/)\n",
|
|
" and [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) with Ray Tune."
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
}
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"orphan": true
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|