mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00

This PR updates the Ray AIR/Tune ipynb examples to use the Tuner() API instead of tune.run(). Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Signed-off-by: Kai Fricke <coding@kaifricke.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
1289 lines
70 KiB
Text
1289 lines
70 KiB
Text
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "aa6af4d3",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Using PyTorch Lightning with Tune\n",
|
|
"\n",
|
|
"(tune-pytorch-lightning-ref)=\n",
|
|
"\n",
|
|
"PyTorch Lightning is a framework which brings structure into training PyTorch models. It\n",
|
|
"aims to avoid boilerplate code, so you don't have to write the same training\n",
|
|
"loops all over again when building a new model.\n",
|
|
"\n",
|
|
"```{image} /images/pytorch_lightning_full.png\n",
|
|
":align: center\n",
|
|
"```\n",
|
|
"\n",
|
|
"The main abstraction of PyTorch Lightning is the `LightningModule` class, which\n",
|
|
"should be extended by your application. There is [a great post on how to transfer your models from vanilla PyTorch to Lightning](https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09).\n",
|
|
"\n",
|
|
"The class structure of PyTorch Lightning makes it very easy to define and tune model\n",
|
|
"parameters. This tutorial will show you how to use Tune to find the best set of\n",
|
|
"parameters for your application on the example of training a MNIST classifier. Notably,\n",
|
|
"the `LightningModule` does not have to be altered at all for this - so you can\n",
|
|
"use it plug and play for your existing models, assuming their parameters are configurable!\n",
|
|
"\n",
|
|
":::{note}\n",
|
|
"To run this example, you will need to install the following:\n",
|
|
"\n",
|
|
"```bash\n",
|
|
"$ pip install \"ray[tune]\" torch torchvision pytorch-lightning\n",
|
|
"```\n",
|
|
":::\n",
|
|
"\n",
|
|
":::{tip}\n",
|
|
"If you want distributed PyTorch Lightning Training on Ray in addition to hyperparameter tuning with Tune,\n",
|
|
"check out the [Ray Lightning Library](https://github.com/ray-project/ray_lightning)\n",
|
|
":::\n",
|
|
"\n",
|
|
"```{contents}\n",
|
|
":backlinks: none\n",
|
|
":local: true\n",
|
|
"```\n",
|
|
"\n",
|
|
"## PyTorch Lightning classifier for MNIST\n",
|
|
"\n",
|
|
"Let's first start with the basic PyTorch Lightning implementation of an MNIST classifier.\n",
|
|
"This classifier does not include any tuning code at this point.\n",
|
|
"\n",
|
|
"Our example builds on the MNIST example from the [blog post we talked about\n",
|
|
"earlier](https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09).\n",
|
|
"\n",
|
|
"First, we run some imports:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "e6e77570",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import math\n",
|
|
"\n",
|
|
"import torch\n",
|
|
"import pytorch_lightning as pl\n",
|
|
"from filelock import FileLock\n",
|
|
"from torch.utils.data import DataLoader, random_split\n",
|
|
"from torch.nn import functional as F\n",
|
|
"from torchvision.datasets import MNIST\n",
|
|
"from torchvision import transforms\n",
|
|
"import os"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3c442e73",
|
|
"metadata": {},
|
|
"source": [
|
|
"And then there is the Lightning model adapted from the blog post.\n",
|
|
"Note that we left out the test set validation and made the model parameters\n",
|
|
"configurable through a `config` dict that is passed on initialization.\n",
|
|
"Also, we specify a `data_dir` where the MNIST data will be stored. Note that\n",
|
|
"we use a `FileLock` for downloading data so that the dataset is only downloaded\n",
|
|
"once per node.\n",
|
|
"Lastly, we added a new metric, the validation accuracy, to the logs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "48b20f48",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class LightningMNISTClassifier(pl.LightningModule):\n",
|
|
" \"\"\"\n",
|
|
" This has been adapted from\n",
|
|
" https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09\n",
|
|
" \"\"\"\n",
|
|
"\n",
|
|
" def __init__(self, config, data_dir=None):\n",
|
|
" super(LightningMNISTClassifier, self).__init__()\n",
|
|
"\n",
|
|
" self.data_dir = data_dir or os.getcwd()\n",
|
|
"\n",
|
|
" self.layer_1_size = config[\"layer_1_size\"]\n",
|
|
" self.layer_2_size = config[\"layer_2_size\"]\n",
|
|
" self.lr = config[\"lr\"]\n",
|
|
" self.batch_size = config[\"batch_size\"]\n",
|
|
"\n",
|
|
" # mnist images are (1, 28, 28) (channels, width, height)\n",
|
|
" self.layer_1 = torch.nn.Linear(28 * 28, self.layer_1_size)\n",
|
|
" self.layer_2 = torch.nn.Linear(self.layer_1_size, self.layer_2_size)\n",
|
|
" self.layer_3 = torch.nn.Linear(self.layer_2_size, 10)\n",
|
|
"\n",
|
|
" def forward(self, x):\n",
|
|
" batch_size, channels, width, height = x.size()\n",
|
|
" x = x.view(batch_size, -1)\n",
|
|
"\n",
|
|
" x = self.layer_1(x)\n",
|
|
" x = torch.relu(x)\n",
|
|
"\n",
|
|
" x = self.layer_2(x)\n",
|
|
" x = torch.relu(x)\n",
|
|
"\n",
|
|
" x = self.layer_3(x)\n",
|
|
" x = torch.log_softmax(x, dim=1)\n",
|
|
"\n",
|
|
" return x\n",
|
|
"\n",
|
|
" def cross_entropy_loss(self, logits, labels):\n",
|
|
" return F.nll_loss(logits, labels)\n",
|
|
"\n",
|
|
" def accuracy(self, logits, labels):\n",
|
|
" _, predicted = torch.max(logits.data, 1)\n",
|
|
" correct = (predicted == labels).sum().item()\n",
|
|
" accuracy = correct / len(labels)\n",
|
|
" return torch.tensor(accuracy)\n",
|
|
"\n",
|
|
" def training_step(self, train_batch, batch_idx):\n",
|
|
" x, y = train_batch\n",
|
|
" logits = self.forward(x)\n",
|
|
" loss = self.cross_entropy_loss(logits, y)\n",
|
|
" accuracy = self.accuracy(logits, y)\n",
|
|
"\n",
|
|
" self.log(\"ptl/train_loss\", loss)\n",
|
|
" self.log(\"ptl/train_accuracy\", accuracy)\n",
|
|
" return loss\n",
|
|
"\n",
|
|
" def validation_step(self, val_batch, batch_idx):\n",
|
|
" x, y = val_batch\n",
|
|
" logits = self.forward(x)\n",
|
|
" loss = self.cross_entropy_loss(logits, y)\n",
|
|
" accuracy = self.accuracy(logits, y)\n",
|
|
" return {\"val_loss\": loss, \"val_accuracy\": accuracy}\n",
|
|
"\n",
|
|
" def validation_epoch_end(self, outputs):\n",
|
|
" avg_loss = torch.stack([x[\"val_loss\"] for x in outputs]).mean()\n",
|
|
" avg_acc = torch.stack([x[\"val_accuracy\"] for x in outputs]).mean()\n",
|
|
" self.log(\"ptl/val_loss\", avg_loss)\n",
|
|
" self.log(\"ptl/val_accuracy\", avg_acc)\n",
|
|
"\n",
|
|
" @staticmethod\n",
|
|
" def download_data(data_dir):\n",
|
|
" transform = transforms.Compose([\n",
|
|
" transforms.ToTensor(),\n",
|
|
" transforms.Normalize((0.1307, ), (0.3081, ))\n",
|
|
" ])\n",
|
|
" with FileLock(os.path.expanduser(\"~/.data.lock\")):\n",
|
|
" return MNIST(data_dir, train=True, download=True, transform=transform)\n",
|
|
"\n",
|
|
" def prepare_data(self):\n",
|
|
" mnist_train = self.download_data(self.data_dir)\n",
|
|
"\n",
|
|
" self.mnist_train, self.mnist_val = random_split(\n",
|
|
" mnist_train, [55000, 5000])\n",
|
|
"\n",
|
|
" def train_dataloader(self):\n",
|
|
" return DataLoader(self.mnist_train, batch_size=int(self.batch_size))\n",
|
|
"\n",
|
|
" def val_dataloader(self):\n",
|
|
" return DataLoader(self.mnist_val, batch_size=int(self.batch_size))\n",
|
|
"\n",
|
|
" def configure_optimizers(self):\n",
|
|
" optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)\n",
|
|
" return optimizer\n",
|
|
"\n",
|
|
"\n",
|
|
"def train_mnist(config):\n",
|
|
" model = LightningMNISTClassifier(config)\n",
|
|
" trainer = pl.Trainer(max_epochs=10, enable_progress_bar=False)\n",
|
|
"\n",
|
|
" trainer.fit(model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "da1c3632",
|
|
"metadata": {},
|
|
"source": [
|
|
"And that's it! You can now run `train_mnist(config)` to train the classifier, e.g.\n",
|
|
"like so:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "86df3d39",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def train_mnist_no_tune():\n",
|
|
" config = {\n",
|
|
" \"layer_1_size\": 128,\n",
|
|
" \"layer_2_size\": 256,\n",
|
|
" \"lr\": 1e-3,\n",
|
|
" \"batch_size\": 64\n",
|
|
" }\n",
|
|
" train_mnist(config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "edcc0991",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Tuning the model parameters\n",
|
|
"\n",
|
|
"The parameters above should give you a good accuracy of over 90% already. However,\n",
|
|
"we might improve on this simply by changing some of the hyperparameters. For instance,\n",
|
|
"maybe we get an even higher accuracy if we used a larger batch size.\n",
|
|
"\n",
|
|
"Instead of guessing the parameter values, let's use Tune to systematically try out\n",
|
|
"parameter combinations and find the best performing set.\n",
|
|
"\n",
|
|
"First, we need some additional imports:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "34faeb3b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from pytorch_lightning.loggers import TensorBoardLogger\n",
|
|
"from ray import air, tune\n",
|
|
"from ray.air import session\n",
|
|
"from ray.tune import CLIReporter\n",
|
|
"from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining\n",
|
|
"from ray.tune.integration.pytorch_lightning import TuneReportCallback, \\\n",
|
|
" TuneReportCheckpointCallback"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f65b9c5f",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Talking to Tune with a PyTorch Lightning callback\n",
|
|
"\n",
|
|
"PyTorch Lightning introduced [Callbacks](https://pytorch-lightning.readthedocs.io/en/latest/extensions/callbacks.html)\n",
|
|
"that can be used to plug custom functions into the training loop. This way the original\n",
|
|
"`LightningModule` does not have to be altered at all. Also, we could use the same\n",
|
|
"callback for multiple modules.\n",
|
|
"\n",
|
|
"Ray Tune comes with ready-to-use PyTorch Lightning callbacks. To report metrics\n",
|
|
"back to Tune after each validation epoch, we will use the `TuneReportCallback`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "4bab80bc",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"<ray.tune.integration.pytorch_lightning.TuneReportCallback at 0x17b305710>"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"TuneReportCallback(\n",
|
|
" {\n",
|
|
" \"loss\": \"ptl/val_loss\",\n",
|
|
" \"mean_accuracy\": \"ptl/val_accuracy\"\n",
|
|
" },\n",
|
|
" on=\"validation_end\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "286a1070",
|
|
"metadata": {},
|
|
"source": [
|
|
"This callback will take the `val_loss` and `val_accuracy` values\n",
|
|
"from the PyTorch Lightning trainer and report them to Tune as the `loss`\n",
|
|
"and `mean_accuracy`, respectively.\n",
|
|
"\n",
|
|
"### Adding the Tune training function\n",
|
|
"\n",
|
|
"Then we specify our training function. Note that we added the `data_dir` as a\n",
|
|
"parameter here to avoid\n",
|
|
"that each training run downloads the full MNIST dataset. Instead, we want to access\n",
|
|
"a shared data location.\n",
|
|
"\n",
|
|
"We are also able to specify the number of epochs to train each model, and the number\n",
|
|
"of GPUs we want to use for training. We also create a TensorBoard logger that writes\n",
|
|
"logfiles directly into Tune's root trial directory - if we didn't do that PyTorch\n",
|
|
"Lightning would create subdirectories, and each trial would thus be shown twice in\n",
|
|
"TensorBoard, one time for Tune's logs, and another time for PyTorch Lightning's logs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "74e7d1c2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def train_mnist_tune(config, num_epochs=10, num_gpus=0, data_dir=\"~/data\"):\n",
|
|
" data_dir = os.path.expanduser(data_dir)\n",
|
|
" model = LightningMNISTClassifier(config, data_dir)\n",
|
|
" trainer = pl.Trainer(\n",
|
|
" max_epochs=num_epochs,\n",
|
|
" # If fractional GPUs passed in, convert to int.\n",
|
|
" gpus=math.ceil(num_gpus),\n",
|
|
" logger=TensorBoardLogger(\n",
|
|
" save_dir=os.getcwd(), name=\"\", version=\".\"),\n",
|
|
" enable_progress_bar=False,\n",
|
|
" callbacks=[\n",
|
|
" TuneReportCallback(\n",
|
|
" {\n",
|
|
" \"loss\": \"ptl/val_loss\",\n",
|
|
" \"mean_accuracy\": \"ptl/val_accuracy\"\n",
|
|
" },\n",
|
|
" on=\"validation_end\")\n",
|
|
" ])\n",
|
|
" trainer.fit(model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cf0f6d6e",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Configuring the search space\n",
|
|
"\n",
|
|
"Now we configure the parameter search space. We would like to choose between three\n",
|
|
"different layer and batch sizes. The learning rate should be sampled uniformly between\n",
|
|
"`0.0001` and `0.1`. The `tune.loguniform()` function is syntactic sugar to make\n",
|
|
"sampling between these different orders of magnitude easier, specifically\n",
|
|
"we are able to also sample small values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "a50645e9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"config = {\n",
|
|
" \"layer_1_size\": tune.choice([32, 64, 128]),\n",
|
|
" \"layer_2_size\": tune.choice([64, 128, 256]),\n",
|
|
" \"lr\": tune.loguniform(1e-4, 1e-1),\n",
|
|
" \"batch_size\": tune.choice([32, 64, 128]),\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b1fb9ecd",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Selecting a scheduler\n",
|
|
"\n",
|
|
"In this example, we use an [Asynchronous Hyperband](https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/)\n",
|
|
"scheduler. This scheduler decides at each iteration which trials are likely to perform\n",
|
|
"badly, and stops these trials. This way we don't waste any resources on bad hyperparameter\n",
|
|
"configurations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "a2596b01",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"num_epochs = 10\n",
|
|
"\n",
|
|
"scheduler = ASHAScheduler(\n",
|
|
" max_t=num_epochs,\n",
|
|
" grace_period=1,\n",
|
|
" reduction_factor=2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9a49ae58",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Changing the CLI output\n",
|
|
"\n",
|
|
"We instantiate a `CLIReporter` to specify which metrics we would like to see in our\n",
|
|
"output tables in the command line. This is optional, but can be used to make sure our\n",
|
|
"output tables only include information we would like to see."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "cd605a16",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"reporter = CLIReporter(\n",
|
|
" parameter_columns=[\"layer_1_size\", \"layer_2_size\", \"lr\", \"batch_size\"],\n",
|
|
" metric_columns=[\"loss\", \"mean_accuracy\", \"training_iteration\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5ec9a305",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Passing constants to the train function\n",
|
|
"\n",
|
|
"The `data_dir`, `num_epochs` and `num_gpus` we pass to the training function\n",
|
|
"are constants. To avoid including them as non-configurable parameters in the `config`\n",
|
|
"specification, we can use `tune.with_parameters` to wrap around the training function."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "332668dc",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"gpus_per_trial = 0\n",
|
|
"data_dir = \"~/data\"\n",
|
|
"\n",
|
|
"train_fn_with_parameters = tune.with_parameters(train_mnist_tune,\n",
|
|
" num_epochs=num_epochs,\n",
|
|
" num_gpus=gpus_per_trial,\n",
|
|
" data_dir=data_dir)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "feef8c39",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Training with GPUs\n",
|
|
"\n",
|
|
"We can specify how many resources Tune should request for each trial.\n",
|
|
"This also includes GPUs.\n",
|
|
"\n",
|
|
"PyTorch Lightning takes care of moving the training to the GPUs. We\n",
|
|
"already made sure that our code is compatible with that, so there's\n",
|
|
"nothing more to do here other than to specify the number of GPUs\n",
|
|
"we would like to use:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "dc402716",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"resources_per_trial = {\"cpu\": 1, \"gpu\": gpus_per_trial}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ca050dfa",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can also specify {doc}`fractional GPUs for Tune <../../ray-core/tasks/using-ray-with-gpus>`,\n",
|
|
"allowing multiple trials to share GPUs and thus increase concurrency under resource constraints.\n",
|
|
"While the `gpus_per_trial` passed into\n",
|
|
"Tune is a decimal value, the `gpus` passed into the `pl.Trainer` should still be an integer.\n",
|
|
"Please note that if using fractional GPUs, it is the user's responsibility to\n",
|
|
"make sure multiple trials can share GPUs and there is enough memory to do so.\n",
|
|
"Ray does not automatically handle this for you.\n",
|
|
"\n",
|
|
"If you want to use multiple GPUs per trial, you should check out the\n",
|
|
"[Ray Lightning Library](https://github.com/ray-project/ray_lightning).\n",
|
|
"This library makes it easy to run multiple concurrent trials with Ray Tune, with each trial also running\n",
|
|
"in a distributed fashion using Ray.\n",
|
|
"\n",
|
|
"### Putting it together\n",
|
|
"\n",
|
|
"Lastly, we need to create a `Tuner()` object and start Ray Tune with `tuner.fit()`.\n",
|
|
"\n",
|
|
"The full code looks like this:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "ea182330",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def tune_mnist_asha(num_samples=10, num_epochs=10, gpus_per_trial=0, data_dir=\"~/data\"):\n",
|
|
" config = {\n",
|
|
" \"layer_1_size\": tune.choice([32, 64, 128]),\n",
|
|
" \"layer_2_size\": tune.choice([64, 128, 256]),\n",
|
|
" \"lr\": tune.loguniform(1e-4, 1e-1),\n",
|
|
" \"batch_size\": tune.choice([32, 64, 128]),\n",
|
|
" }\n",
|
|
"\n",
|
|
" scheduler = ASHAScheduler(\n",
|
|
" max_t=num_epochs,\n",
|
|
" grace_period=1,\n",
|
|
" reduction_factor=2)\n",
|
|
"\n",
|
|
" reporter = CLIReporter(\n",
|
|
" parameter_columns=[\"layer_1_size\", \"layer_2_size\", \"lr\", \"batch_size\"],\n",
|
|
" metric_columns=[\"loss\", \"mean_accuracy\", \"training_iteration\"])\n",
|
|
"\n",
|
|
" train_fn_with_parameters = tune.with_parameters(train_mnist_tune,\n",
|
|
" num_epochs=num_epochs,\n",
|
|
" num_gpus=gpus_per_trial,\n",
|
|
" data_dir=data_dir)\n",
|
|
" resources_per_trial = {\"cpu\": 1, \"gpu\": gpus_per_trial}\n",
|
|
" \n",
|
|
" tuner = tune.Tuner(\n",
|
|
" tune.with_resources(\n",
|
|
" train_fn_with_parameters,\n",
|
|
" resources=resources_per_trial\n",
|
|
" ),\n",
|
|
" tune_config=tune.TuneConfig(\n",
|
|
" metric=\"loss\",\n",
|
|
" mode=\"min\",\n",
|
|
" scheduler=scheduler,\n",
|
|
" num_samples=num_samples,\n",
|
|
" ),\n",
|
|
" run_config=air.RunConfig(\n",
|
|
" name=\"tune_mnist_asha\",\n",
|
|
" progress_reporter=reporter,\n",
|
|
" ),\n",
|
|
" param_space=config,\n",
|
|
" )\n",
|
|
" results = tuner.fit()\n",
|
|
"\n",
|
|
" print(\"Best hyperparameters found were: \", results.get_best_result().config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1fb96b6c",
|
|
"metadata": {},
|
|
"source": [
|
|
"In the example above, Tune runs 10 trials with different hyperparameter configurations.\n",
|
|
"An example output could look like so:\n",
|
|
"\n",
|
|
"```{code-block} bash\n",
|
|
":emphasize-lines: 12\n",
|
|
"\n",
|
|
" +------------------------------+------------+-------+----------------+----------------+-------------+--------------+----------+-----------------+----------------------+\n",
|
|
" | Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
" |------------------------------+------------+-------+----------------+----------------+-------------+--------------+----------+-----------------+----------------------|\n",
|
|
" | train_mnist_tune_63ecc_00000 | TERMINATED | | 128 | 64 | 0.00121197 | 128 | 0.120173 | 0.972461 | 10 |\n",
|
|
" | train_mnist_tune_63ecc_00001 | TERMINATED | | 64 | 128 | 0.0301395 | 128 | 0.454836 | 0.868164 | 4 |\n",
|
|
" | train_mnist_tune_63ecc_00002 | TERMINATED | | 64 | 128 | 0.0432097 | 128 | 0.718396 | 0.718359 | 1 |\n",
|
|
" | train_mnist_tune_63ecc_00003 | TERMINATED | | 32 | 128 | 0.000294669 | 32 | 0.111475 | 0.965764 | 10 |\n",
|
|
" | train_mnist_tune_63ecc_00004 | TERMINATED | | 32 | 256 | 0.000386664 | 64 | 0.133538 | 0.960839 | 8 |\n",
|
|
" | train_mnist_tune_63ecc_00005 | TERMINATED | | 128 | 128 | 0.0837395 | 32 | 2.32628 | 0.0991242 | 1 |\n",
|
|
" | train_mnist_tune_63ecc_00006 | TERMINATED | | 64 | 128 | 0.000158761 | 128 | 0.134595 | 0.959766 | 10 |\n",
|
|
" | train_mnist_tune_63ecc_00007 | TERMINATED | | 64 | 64 | 0.000672126 | 64 | 0.118182 | 0.972903 | 10 |\n",
|
|
" | train_mnist_tune_63ecc_00008 | TERMINATED | | 128 | 64 | 0.000502428 | 32 | 0.11082 | 0.975518 | 10 |\n",
|
|
" | train_mnist_tune_63ecc_00009 | TERMINATED | | 64 | 256 | 0.00112894 | 32 | 0.13472 | 0.971935 | 8 |\n",
|
|
" +------------------------------+------------+-------+----------------+----------------+-------------+--------------+----------+-----------------+----------------------+\n",
|
|
"```\n",
|
|
"\n",
|
|
"As you can see in the `training_iteration` column, trials with a high loss\n",
|
|
"(and low accuracy) have been terminated early. The best performing trial used\n",
|
|
"`layer_1_size=128`, `layer_2_size=64`, `lr=0.000502428` and\n",
|
|
"`batch_size=32`.\n",
|
|
"\n",
|
|
"## Using Population Based Training to find the best parameters\n",
|
|
"\n",
|
|
"The `ASHAScheduler` terminates those trials early that show bad performance.\n",
|
|
"Sometimes, this stops trials that would get better after more training steps,\n",
|
|
"and which might eventually even show better performance than other configurations.\n",
|
|
"\n",
|
|
"Another popular method for hyperparameter tuning, called\n",
|
|
"[Population Based Training](https://deepmind.com/blog/article/population-based-training-neural-networks),\n",
|
|
"instead perturbs hyperparameters during the training run. Tune implements PBT, and\n",
|
|
"we only need to make some slight adjustments to our code.\n",
|
|
"\n",
|
|
"### Adding checkpoints to the PyTorch Lightning module\n",
|
|
"\n",
|
|
"First, we need to introduce\n",
|
|
"another callback to save model checkpoints. Since Tune requires a call to\n",
|
|
"`session.report()` after creating a new checkpoint to register it, we will use\n",
|
|
"a combined reporting and checkpointing callback:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"id": "7f86e4d8",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"<ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback at 0x17a626090>"
|
|
]
|
|
},
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"TuneReportCheckpointCallback(\n",
|
|
" metrics={\n",
|
|
" \"loss\": \"ptl/val_loss\",\n",
|
|
" \"mean_accuracy\": \"ptl/val_accuracy\"\n",
|
|
" },\n",
|
|
" filename=\"checkpoint\",\n",
|
|
" on=\"validation_end\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "33a76d5b",
|
|
"metadata": {},
|
|
"source": [
|
|
"The `checkpoint` value is the name of the checkpoint file within the\n",
|
|
"checkpoint directory.\n",
|
|
"\n",
|
|
"We also include checkpoint loading in our training function:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"id": "746e962a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def train_mnist_tune_checkpoint(config,\n",
|
|
" checkpoint_dir=None,\n",
|
|
" num_epochs=10,\n",
|
|
" num_gpus=0,\n",
|
|
" data_dir=\"~/data\"):\n",
|
|
" data_dir = os.path.expanduser(data_dir)\n",
|
|
" kwargs = {\n",
|
|
" \"max_epochs\": num_epochs,\n",
|
|
" # If fractional GPUs passed in, convert to int.\n",
|
|
" \"gpus\": math.ceil(num_gpus),\n",
|
|
" \"logger\": TensorBoardLogger(\n",
|
|
" save_dir=os.getcwd(), name=\"\", version=\".\"),\n",
|
|
" \"enable_progress_bar\": False,\n",
|
|
" \"callbacks\": [\n",
|
|
" TuneReportCheckpointCallback(\n",
|
|
" metrics={\n",
|
|
" \"loss\": \"ptl/val_loss\",\n",
|
|
" \"mean_accuracy\": \"ptl/val_accuracy\"\n",
|
|
" },\n",
|
|
" filename=\"checkpoint\",\n",
|
|
" on=\"validation_end\")\n",
|
|
" ]\n",
|
|
" }\n",
|
|
"\n",
|
|
" if checkpoint_dir:\n",
|
|
" kwargs[\"resume_from_checkpoint\"] = os.path.join(\n",
|
|
" checkpoint_dir, \"checkpoint\")\n",
|
|
"\n",
|
|
" model = LightningMNISTClassifier(config=config, data_dir=data_dir)\n",
|
|
" trainer = pl.Trainer(**kwargs)\n",
|
|
"\n",
|
|
" trainer.fit(model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "39dc7b46",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Configuring and running Population Based Training\n",
|
|
"\n",
|
|
"We need to call Tune slightly differently:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"id": "e12a1bd5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def tune_mnist_pbt(num_samples=10, num_epochs=10, gpus_per_trial=0, data_dir=\"~/data\"):\n",
|
|
" config = {\n",
|
|
" \"layer_1_size\": tune.choice([32, 64, 128]),\n",
|
|
" \"layer_2_size\": tune.choice([64, 128, 256]),\n",
|
|
" \"lr\": 1e-3,\n",
|
|
" \"batch_size\": 64,\n",
|
|
" }\n",
|
|
"\n",
|
|
" scheduler = PopulationBasedTraining(\n",
|
|
" perturbation_interval=4,\n",
|
|
" hyperparam_mutations={\n",
|
|
" \"lr\": tune.loguniform(1e-4, 1e-1),\n",
|
|
" \"batch_size\": [32, 64, 128]\n",
|
|
" })\n",
|
|
"\n",
|
|
" reporter = CLIReporter(\n",
|
|
" parameter_columns=[\"layer_1_size\", \"layer_2_size\", \"lr\", \"batch_size\"],\n",
|
|
" metric_columns=[\"loss\", \"mean_accuracy\", \"training_iteration\"])\n",
|
|
" \n",
|
|
" tuner = tune.Tuner(\n",
|
|
" tune.with_resources(\n",
|
|
" tune.with_parameters(\n",
|
|
" train_mnist_tune_checkpoint,\n",
|
|
" num_epochs=num_epochs,\n",
|
|
" num_gpus=gpus_per_trial,\n",
|
|
" data_dir=data_dir),\n",
|
|
" resources={\n",
|
|
" \"cpu\": 1,\n",
|
|
" \"gpu\": gpus_per_trial\n",
|
|
" }\n",
|
|
" ),\n",
|
|
" tune_config=tune.TuneConfig(\n",
|
|
" metric=\"loss\",\n",
|
|
" mode=\"min\",\n",
|
|
" scheduler=scheduler,\n",
|
|
" num_samples=num_samples,\n",
|
|
" ),\n",
|
|
" run_config=air.RunConfig(\n",
|
|
" name=\"tune_mnist_asha\",\n",
|
|
" progress_reporter=reporter,\n",
|
|
" ),\n",
|
|
" param_space=config,\n",
|
|
" )\n",
|
|
" results = tuner.fit()\n",
|
|
"\n",
|
|
" print(\"Best hyperparameters found were: \", results.get_best_result().config)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6087f807",
|
|
"metadata": {},
|
|
"source": [
|
|
"Instead of passing tune parameters to the `config` dict, we start\n",
|
|
"with fixed values, though we are also able to sample some of them, like the\n",
|
|
"layer sizes. Additionally, we have to tell PBT how to perturb the hyperparameters.\n",
|
|
"Note that the layer sizes are not tuned right here. This is because we cannot simply\n",
|
|
"change layer sizes during a training run - which is what would happen in PBT.\n",
|
|
"\n",
|
|
"To test running both of our main scripts (`tune_mnist_asha` and `tune_mnist_pbt`), all you have to do is specify\n",
|
|
"a `data_dir` folder and run the scripts with reasonable parameters:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"id": "eb9faf3e",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%%\n"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:24:58 (running for 00:00:00.16)\n",
|
|
"Memory usage on this node: 11.2/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m GPU available: False, used: False\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m TPU available: False, using: 0 TPU cores\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m IPU available: False, using: 0 IPUs\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m HPU available: False, using: 0 HPUs\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m /Users/kai/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py:336: LightningDeprecationWarning: The `on_keyboard_interrupt` callback hook was deprecated in v1.5 and will be removed in v1.7. Please use the `on_exception` callback hook instead.\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m \"The `on_keyboard_interrupt` callback hook was deprecated in v1.5 and will be removed in v1.7.\"\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m /Users/kai/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py:348: LightningDeprecationWarning: The `on_init_start` callback hook was deprecated in v1.6 and will be removed in v1.8.\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m \"The `on_init_start` callback hook was deprecated in v1.6 and will be removed in v1.8.\"\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m /Users/kai/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py:351: LightningDeprecationWarning: The `on_init_end` callback hook was deprecated in v1.6 and will be removed in v1.8.\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m rank_zero_deprecation(\"The `on_init_end` callback hook was deprecated in v1.6 and will be removed in v1.8.\")\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m /Users/kai/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py:377: LightningDeprecationWarning: The `Callback.on_batch_start` hook was deprecated in v1.6 and will be removed in v1.8. Please use `Callback.on_train_batch_start` instead.\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m f\"The `Callback.{hook}` hook was deprecated in v1.6 and\"\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m /Users/kai/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py:377: LightningDeprecationWarning: The `Callback.on_batch_end` hook was deprecated in v1.6 and will be removed in v1.8. Please use `Callback.on_train_batch_end` instead.\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m f\"The `Callback.{hook}` hook was deprecated in v1.6 and\"\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m /Users/kai/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py:386: LightningDeprecationWarning: The `Callback.on_epoch_start` hook was deprecated in v1.6 and will be removed in v1.8. Please use `Callback.on_<train/validation/test>_epoch_start` instead.\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m f\"The `Callback.{hook}` hook was deprecated in v1.6 and\"\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m /Users/kai/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pytorch_lightning/trainer/configuration_validator.py:386: LightningDeprecationWarning: The `Callback.on_epoch_end` hook was deprecated in v1.6 and will be removed in v1.8. Please use `Callback.on_<train/validation/test>_epoch_end` instead.\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m f\"The `Callback.{hook}` hook was deprecated in v1.6 and\"\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m \n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m | Name | Type | Params\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m -----------------------------------\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m 0 | layer_1 | Linear | 100 K \n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m 1 | layer_2 | Linear | 16.5 K\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m 2 | layer_3 | Linear | 1.3 K \n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m -----------------------------------\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m 118 K Trainable params\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m 0 Non-trainable params\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m 118 K Total params\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m 0.473 Total estimated model params size (MB)\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m /Users/kai/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:245: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m category=PossibleUserWarning,\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m /Users/kai/.pyenv/versions/3.7.7/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:245: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.\n",
|
|
"\u001b[2m\u001b[36m(train_mnist_tune pid=52355)\u001b[0m category=PossibleUserWarning,\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:25:13 (running for 00:00:15.04)\n",
|
|
"Memory usage on this node: 11.7/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+\n",
|
|
"\n",
|
|
"\n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:25:18 (running for 00:00:20.06)\n",
|
|
"Memory usage on this node: 11.7/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+\n",
|
|
"\n",
|
|
"\n",
|
|
"Result for train_mnist_tune_727f7_00000:\n",
|
|
" date: 2022-07-22_16-25-19\n",
|
|
" done: false\n",
|
|
" experiment_id: d137534eb136478c9e9c4514a538f9da\n",
|
|
" hostname: Kais-MacBook-Pro.local\n",
|
|
" iterations_since_restore: 1\n",
|
|
" loss: 0.12323953211307526\n",
|
|
" mean_accuracy: 0.9600474834442139\n",
|
|
" node_ip: 127.0.0.1\n",
|
|
" pid: 52355\n",
|
|
" time_since_restore: 11.140714168548584\n",
|
|
" time_this_iter_s: 11.140714168548584\n",
|
|
" time_total_s: 11.140714168548584\n",
|
|
" timestamp: 1658503519\n",
|
|
" timesteps_since_restore: 0\n",
|
|
" training_iteration: 1\n",
|
|
" trial_id: 727f7_00000\n",
|
|
" warmup_time: 0.0038437843322753906\n",
|
|
" \n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:25:24 (running for 00:00:26.19)\n",
|
|
"Memory usage on this node: 12.0/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.12323953211307526 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+---------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------+---------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.12324 | 0.960047 | 1 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+---------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:25:29 (running for 00:00:31.21)\n",
|
|
"Memory usage on this node: 11.9/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.12323953211307526 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+---------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------+---------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.12324 | 0.960047 | 1 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+---------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n",
|
|
"Result for train_mnist_tune_727f7_00000:\n",
|
|
" date: 2022-07-22_16-25-31\n",
|
|
" done: false\n",
|
|
" experiment_id: d137534eb136478c9e9c4514a538f9da\n",
|
|
" hostname: Kais-MacBook-Pro.local\n",
|
|
" iterations_since_restore: 2\n",
|
|
" loss: 0.09032993763685226\n",
|
|
" mean_accuracy: 0.9731012582778931\n",
|
|
" node_ip: 127.0.0.1\n",
|
|
" pid: 52355\n",
|
|
" time_since_restore: 22.593159914016724\n",
|
|
" time_this_iter_s: 11.45244574546814\n",
|
|
" time_total_s: 22.593159914016724\n",
|
|
" timestamp: 1658503531\n",
|
|
" timesteps_since_restore: 0\n",
|
|
" training_iteration: 2\n",
|
|
" trial_id: 727f7_00000\n",
|
|
" warmup_time: 0.0038437843322753906\n",
|
|
" \n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:25:36 (running for 00:00:37.64)\n",
|
|
"Memory usage on this node: 12.1/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: None | Iter 2.000: -0.09032993763685226 | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.09032993763685226 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.0903299 | 0.973101 | 2 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:25:41 (running for 00:00:42.66)\n",
|
|
"Memory usage on this node: 12.1/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: None | Iter 2.000: -0.09032993763685226 | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.09032993763685226 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.0903299 | 0.973101 | 2 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Result for train_mnist_tune_727f7_00000:\n",
|
|
" date: 2022-07-22_16-25-42\n",
|
|
" done: false\n",
|
|
" experiment_id: d137534eb136478c9e9c4514a538f9da\n",
|
|
" hostname: Kais-MacBook-Pro.local\n",
|
|
" iterations_since_restore: 3\n",
|
|
" loss: 0.09614239633083344\n",
|
|
" mean_accuracy: 0.9754746556282043\n",
|
|
" node_ip: 127.0.0.1\n",
|
|
" pid: 52355\n",
|
|
" time_since_restore: 33.65132713317871\n",
|
|
" time_this_iter_s: 11.058167219161987\n",
|
|
" time_total_s: 33.65132713317871\n",
|
|
" timestamp: 1658503542\n",
|
|
" timesteps_since_restore: 0\n",
|
|
" training_iteration: 3\n",
|
|
" trial_id: 727f7_00000\n",
|
|
" warmup_time: 0.0038437843322753906\n",
|
|
" \n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:25:47 (running for 00:00:48.70)\n",
|
|
"Memory usage on this node: 12.1/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: None | Iter 2.000: -0.09032993763685226 | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.09614239633083344 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.0961424 | 0.975475 | 3 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:25:52 (running for 00:00:53.72)\n",
|
|
"Memory usage on this node: 12.1/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: None | Iter 2.000: -0.09032993763685226 | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.09614239633083344 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.0961424 | 0.975475 | 3 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n",
|
|
"Result for train_mnist_tune_727f7_00000:\n",
|
|
" date: 2022-07-22_16-25-53\n",
|
|
" done: false\n",
|
|
" experiment_id: d137534eb136478c9e9c4514a538f9da\n",
|
|
" hostname: Kais-MacBook-Pro.local\n",
|
|
" iterations_since_restore: 4\n",
|
|
" loss: 0.09530177712440491\n",
|
|
" mean_accuracy: 0.9760680198669434\n",
|
|
" node_ip: 127.0.0.1\n",
|
|
" pid: 52355\n",
|
|
" time_since_restore: 44.58630990982056\n",
|
|
" time_this_iter_s: 10.934982776641846\n",
|
|
" time_total_s: 44.58630990982056\n",
|
|
" timestamp: 1658503553\n",
|
|
" timesteps_since_restore: 0\n",
|
|
" training_iteration: 4\n",
|
|
" trial_id: 727f7_00000\n",
|
|
" warmup_time: 0.0038437843322753906\n",
|
|
" \n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:25:58 (running for 00:00:59.63)\n",
|
|
"Memory usage on this node: 11.9/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: -0.09530177712440491 | Iter 2.000: -0.09032993763685226 | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.09530177712440491 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.0953018 | 0.976068 | 4 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:26:03 (running for 00:01:04.65)\n",
|
|
"Memory usage on this node: 11.9/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: -0.09530177712440491 | Iter 2.000: -0.09032993763685226 | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.09530177712440491 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.0953018 | 0.976068 | 4 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+-----------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n",
|
|
"Result for train_mnist_tune_727f7_00000:\n",
|
|
" date: 2022-07-22_16-26-04\n",
|
|
" done: false\n",
|
|
" experiment_id: d137534eb136478c9e9c4514a538f9da\n",
|
|
" hostname: Kais-MacBook-Pro.local\n",
|
|
" iterations_since_restore: 5\n",
|
|
" loss: 0.10016436874866486\n",
|
|
" mean_accuracy: 0.9750791192054749\n",
|
|
" node_ip: 127.0.0.1\n",
|
|
" pid: 52355\n",
|
|
" time_since_restore: 55.61101007461548\n",
|
|
" time_this_iter_s: 11.024700164794922\n",
|
|
" time_total_s: 55.61101007461548\n",
|
|
" timestamp: 1658503564\n",
|
|
" timesteps_since_restore: 0\n",
|
|
" training_iteration: 5\n",
|
|
" trial_id: 727f7_00000\n",
|
|
" warmup_time: 0.0038437843322753906\n",
|
|
" \n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:26:09 (running for 00:01:10.66)\n",
|
|
"Memory usage on this node: 12.0/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: -0.09530177712440491 | Iter 2.000: -0.09032993763685226 | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.10016436874866486 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------+----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.100164 | 0.975079 | 5 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+----------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:26:14 (running for 00:01:15.67)\n",
|
|
"Memory usage on this node: 12.0/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=0\n",
|
|
"Bracket: Iter 4.000: -0.09530177712440491 | Iter 2.000: -0.09032993763685226 | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 1.0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.10016436874866486 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 RUNNING)\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+----------+-----------------+----------------+----------------+------------+--------------+----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | RUNNING | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.100164 | 0.975079 | 5 |\n",
|
|
"+------------------------------+----------+-----------------+----------------+----------------+------------+--------------+----------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"2022-07-22 16:26:15,433\tINFO tune.py:738 -- Total run time: 76.74 seconds (76.61 seconds for the tuning loop).\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Result for train_mnist_tune_727f7_00000:\n",
|
|
" date: 2022-07-22_16-26-15\n",
|
|
" done: true\n",
|
|
" experiment_id: d137534eb136478c9e9c4514a538f9da\n",
|
|
" hostname: Kais-MacBook-Pro.local\n",
|
|
" iterations_since_restore: 6\n",
|
|
" loss: 0.10947871953248978\n",
|
|
" mean_accuracy: 0.9756724834442139\n",
|
|
" node_ip: 127.0.0.1\n",
|
|
" pid: 52355\n",
|
|
" time_since_restore: 66.58598804473877\n",
|
|
" time_this_iter_s: 10.974977970123291\n",
|
|
" time_total_s: 66.58598804473877\n",
|
|
" timestamp: 1658503575\n",
|
|
" timesteps_since_restore: 0\n",
|
|
" training_iteration: 6\n",
|
|
" trial_id: 727f7_00000\n",
|
|
" warmup_time: 0.0038437843322753906\n",
|
|
" \n",
|
|
"== Status ==\n",
|
|
"Current time: 2022-07-22 16:26:15 (running for 00:01:16.62)\n",
|
|
"Memory usage on this node: 12.1/16.0 GiB\n",
|
|
"Using AsyncHyperBand: num_stopped=1\n",
|
|
"Bracket: Iter 4.000: -0.09530177712440491 | Iter 2.000: -0.09032993763685226 | Iter 1.000: -0.12323953211307526\n",
|
|
"Resources requested: 0/16 CPUs, 0/0 GPUs, 0.0/4.72 GiB heap, 0.0/2.0 GiB objects\n",
|
|
"Current best trial: 727f7_00000 with loss=0.10947871953248978 and parameters={'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n",
|
|
"Result logdir: /Users/kai/ray_results/tune_mnist_asha\n",
|
|
"Number of trials: 1/1 (1 TERMINATED)\n",
|
|
"+------------------------------+------------+-----------------+----------------+----------------+------------+--------------+----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|------------------------------+------------+-----------------+----------------+----------------+------------+--------------+----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_727f7_00000 | TERMINATED | 127.0.0.1:52355 | 128 | 128 | 0.00165008 | 64 | 0.109479 | 0.975672 | 6 |\n",
|
|
"+------------------------------+------------+-----------------+----------------+----------------+------------+--------------+----------+-----------------+----------------------+\n",
|
|
"\n",
|
|
"\n",
|
|
"Best hyperparameters found were: {'layer_1_size': 128, 'layer_2_size': 128, 'lr': 0.001650077499050015, 'batch_size': 64}\n"
|
|
]
|
|
},
|
|
{
|
|
"ename": "TypeError",
|
|
"evalue": "__init__() got an unexpected keyword argument 'tune_mnist_pbt'",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
|
|
"\u001b[0;32m/var/folders/b2/0_91bd757rz02lrmr920v0gw0000gn/T/ipykernel_52122/1146224506.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mtune_mnist_asha\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnum_samples\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnum_epochs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m6\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgpus_per_trial\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata_dir\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdata_dir\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mtune_mnist_pbt\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnum_samples\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnum_epochs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m6\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgpus_per_trial\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata_dir\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdata_dir\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
|
|
"\u001b[0;32m/var/folders/b2/0_91bd757rz02lrmr920v0gw0000gn/T/ipykernel_52122/328169407.py\u001b[0m in \u001b[0;36mtune_mnist_pbt\u001b[0;34m(num_samples, num_epochs, gpus_per_trial, data_dir)\u001b[0m\n\u001b[1;32m 38\u001b[0m run_config=air.RunConfig(\n\u001b[1;32m 39\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"tune_mnist_asha\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 40\u001b[0;31m \u001b[0mtune_mnist_pbt\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mreporter\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 41\u001b[0m ),\n\u001b[1;32m 42\u001b[0m \u001b[0mparam_space\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mconfig\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
|
|
"\u001b[0;31mTypeError\u001b[0m: __init__() got an unexpected keyword argument 'tune_mnist_pbt'"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"data_dir = \"~/data/\"\n",
|
|
"\n",
|
|
"tune_mnist_asha(num_samples=1, num_epochs=6, gpus_per_trial=0, data_dir=data_dir)\n",
|
|
"tune_mnist_pbt(num_samples=1, num_epochs=6, gpus_per_trial=0, data_dir=data_dir)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3ae0eea6",
|
|
"metadata": {
|
|
"pycharm": {
|
|
"name": "#%% md\n"
|
|
}
|
|
},
|
|
"source": [
|
|
"If you have more resources available (e.g. a GPU), you can modify the above parameters accordingly.\n",
|
|
"\n",
|
|
"An example output of a run could look like this:\n",
|
|
"\n",
|
|
"```bash\n",
|
|
"+-----------------------------------------+------------+-------+----------------+----------------+-----------+--------------+-----------+-----------------+----------------------+\n",
|
|
"| Trial name | status | loc | layer_1_size | layer_2_size | lr | batch_size | loss | mean_accuracy | training_iteration |\n",
|
|
"|-----------------------------------------+------------+-------+----------------+----------------+-----------+--------------+-----------+-----------------+----------------------|\n",
|
|
"| train_mnist_tune_checkpoint_85489_00000 | TERMINATED | | 128 | 128 | 0.001 | 64 | 0.108734 | 0.973101 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00001 | TERMINATED | | 128 | 128 | 0.001 | 64 | 0.093577 | 0.978639 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00002 | TERMINATED | | 128 | 256 | 0.0008 | 32 | 0.0922348 | 0.979299 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00003 | TERMINATED | | 64 | 256 | 0.001 | 64 | 0.124648 | 0.973892 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00004 | TERMINATED | | 128 | 64 | 0.001 | 64 | 0.101717 | 0.975079 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00005 | TERMINATED | | 64 | 64 | 0.001 | 64 | 0.121467 | 0.969146 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00006 | TERMINATED | | 128 | 256 | 0.00064 | 32 | 0.053446 | 0.987062 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00007 | TERMINATED | | 128 | 256 | 0.001 | 64 | 0.129804 | 0.973497 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00008 | TERMINATED | | 64 | 256 | 0.0285125 | 128 | 0.363236 | 0.913867 | 10 |\n",
|
|
"| train_mnist_tune_checkpoint_85489_00009 | TERMINATED | | 32 | 256 | 0.001 | 64 | 0.150946 | 0.964201 | 10 |\n",
|
|
"+-----------------------------------------+------------+-------+----------------+----------------+-----------+--------------+-----------+-----------------+----------------------+\n",
|
|
"```\n",
|
|
"\n",
|
|
"As you can see, each sample ran the full number of 10 iterations.\n",
|
|
"All trials ended with quite good parameter combinations and showed relatively good performances.\n",
|
|
"In some runs, the parameters have been perturbed. And the best configuration even reached a\n",
|
|
"mean validation accuracy of `0.987062`!\n",
|
|
"\n",
|
|
"In summary, PyTorch Lightning Modules are easy to extend to use with Tune. It just took\n",
|
|
"us importing one or two callbacks and a small wrapper function to get great performing\n",
|
|
"parameter configurations.\n",
|
|
"\n",
|
|
"## More PyTorch Lightning Examples\n",
|
|
"\n",
|
|
"- {doc}`/tune/examples/includes/mnist_ptl_mini`:\n",
|
|
" A minimal example of using [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)\n",
|
|
" to train a MNIST model. This example utilizes the Ray Tune-provided\n",
|
|
" {ref}`PyTorch Lightning callbacks <tune-integration-pytorch-lightning>`.\n",
|
|
" See also {ref}`this tutorial for a full walkthrough <tune-pytorch-lightning-ref>`.\n",
|
|
"- {ref}`A walkthrough tutorial for using Ray Tune with Pytorch-Lightning <tune-pytorch-lightning-ref>`.\n",
|
|
"- {doc}`/tune/examples/includes/mlflow_ptl_example`: Example for using [MLflow](https://github.com/mlflow/mlflow/)\n",
|
|
" and [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) with Ray Tune."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.7"
|
|
},
|
|
"orphan": true
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|