# Batch L-BFGS This document provides a walkthrough of the L-BFGS example. To run the application, first install these dependencies. - SciPy - [TensorFlow](https://www.tensorflow.org/) Then from the directory `ray/examples/lbfgs/` run the following. ``` source ../../setup-env.sh python driver.py ``` Optimization is at the heart of many machine learning algorithms. Much of machine learning involves specifying a loss function and finding the parameters that minimize the loss. If we can compute the gradient of the loss function, then we can apply a variety of gradient-based optimization algorithms. L-BFGS is one such algorithm. It is a quasi-Newton method that uses gradient information to approximate the inverse Hessian of the loss function in a computationally efficient manner. ## The serial version First we load the data in batches. Here, each element in `batches` is a tuple whose first component is a batch of `100` images and whose second component is a batch of the `100` corresponding labels. For simplicity, we use TensorFlow's built in methods for loading the data. ```python from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) batch_size = 100 num_batches = mnist.train.num_examples / batch_size batches = [mnist.train.next_batch(batch_size) for _ in range(num_batches)] ``` Now, suppose we have defined a function which takes a set of model parameters `theta` and a batch of data (both images and labels) and computes the loss for that choice of model parameters on that batch of data. Similarly, suppose we've also defined a function that takes the same arguments and computes the gradient of the loss for that choice of model parameters. ```python def loss(theta, xs, ys): # compute the loss on a batch of data return loss def grad(theta, xs, ys): # compute the gradient on a batch of data return grad def full_loss(theta): # compute the loss on the full data set return sum([loss(theta, xs, ys) for (xs, ys) in batches]) def full_grad(theta): # compute the gradient on the full data set return sum([grad(theta, xs, ys) for (xs, ys) in batches]) ``` Since we are working with a small dataset, we don't actually need to separate these methods into the part that operates on a batch and the part that operates on the full dataset, but doing so will make the distributed version clearer. Now, if we wish to optimize the loss function using L-BFGS, we simply plug these functions, along with an initial choice of model parameters, into `scipy.optimize.fmin_l_bfgs_b`. ```python theta_init = 1e-2 * np.random.normal(size=dim) result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad) ``` ## The distributed version In this example, the computation of the gradient itself can be done in parallel on a number of workers or machines. First, let's turn the data into a collection of remote objects. ```python batch_ids = [(ray.put(xs), ray.put(ys)) for (xs, ys) in batches] ``` We can load the data on the driver and distribute it this way because MNIST easily fits on a single machine. However, for larger data sets, we will need to use remote functions to distribute the loading of the data. Now, lets turn `loss` and `grad` into remote functions. ```python @ray.remote def loss(theta, xs, ys): # compute the loss return loss @ray.remote def grad(theta, xs, ys): # compute the gradient return grad ``` The only difference is that we added the `@ray.remote` decorator. Now, it is easy to speed up the computation of the full loss and the full gradient. ```python def full_loss(theta): theta_id = ray.put(theta) loss_ids = [loss.remote(theta_id, xs_id, ys_id) for (xs_id, ys_id) in batch_ids] return sum(ray.get(loss_ids)) def full_grad(theta): theta_id = ray.put(theta) grad_ids = [grad.remote(theta_id, xs_id, ys_id) for (xs_id, ys_id) in batch_ids] return sum(ray.get(grad_ids)).astype("float64") # This conversion is necessary for use with fmin_l_bfgs_b. ``` Note that we turn `theta` into a remote object with the line `theta_id = ray.put(theta)` before passing it into the remote functions. If we had written ```python [loss.remote(theta, xs_id, ys_id) for (xs_id, ys_id) in batch_ids] ``` instead of ```python theta_id = ray.put(theta) [loss.remote(theta_id, xs_id, ys_id) for (xs_id, ys_id) in batch_ids] ``` then each task that got sent to the scheduler (one for every element of `batch_ids`) would have had a copy of `theta` serialized inside of it. Since `theta` here consists of the parameters of a potentially large model, this is inefficient. *Large objects should be passed by object ID to remote functions and not by value*. We use remote functions and remote objects internally in the implementation of `full_loss` and `full_grad`, but the user-facing behavior of these methods is identical to the behavior in the serial version. We can now optimize the objective with the same function call as before. ```python theta_init = 1e-2 * np.random.normal(size=dim) result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad) ```