2017-03-11 15:30:31 -08:00
|
|
|
Batch L-BFGS
|
|
|
|
============
|
|
|
|
|
|
|
|
This document provides a walkthrough of the L-BFGS example. To run the
|
|
|
|
application, first install these dependencies.
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
|
|
pip install tensorflow
|
|
|
|
pip install scipy
|
|
|
|
|
2017-07-14 13:31:53 -07:00
|
|
|
You can view the `code for this example`_.
|
|
|
|
|
2019-08-08 23:35:55 -07:00
|
|
|
.. _`code for this example`: https://github.com/ray-project/ray/tree/master/doc/examples/lbfgs
|
2017-07-14 13:31:53 -07:00
|
|
|
|
2017-03-11 15:30:31 -08:00
|
|
|
Then you can run the example as follows.
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
2019-08-08 23:35:55 -07:00
|
|
|
python ray/doc/examples/lbfgs/driver.py
|
2017-03-11 15:30:31 -08:00
|
|
|
|
|
|
|
|
|
|
|
Optimization is at the heart of many machine learning algorithms. Much of
|
|
|
|
machine learning involves specifying a loss function and finding the parameters
|
|
|
|
that minimize the loss. If we can compute the gradient of the loss function,
|
|
|
|
then we can apply a variety of gradient-based optimization algorithms. L-BFGS is
|
|
|
|
one such algorithm. It is a quasi-Newton method that uses gradient information
|
|
|
|
to approximate the inverse Hessian of the loss function in a computationally
|
|
|
|
efficient manner.
|
|
|
|
|
|
|
|
The serial version
|
|
|
|
------------------
|
|
|
|
|
|
|
|
First we load the data in batches. Here, each element in ``batches`` is a tuple
|
|
|
|
whose first component is a batch of ``100`` images and whose second component is a
|
|
|
|
batch of the ``100`` corresponding labels. For simplicity, we use TensorFlow's
|
|
|
|
built in methods for loading the data.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
from tensorflow.examples.tutorials.mnist import input_data
|
|
|
|
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
|
|
|
|
batch_size = 100
|
|
|
|
num_batches = mnist.train.num_examples // batch_size
|
|
|
|
batches = [mnist.train.next_batch(batch_size) for _ in range(num_batches)]
|
|
|
|
|
|
|
|
Now, suppose we have defined a function which takes a set of model parameters
|
|
|
|
``theta`` and a batch of data (both images and labels) and computes the loss for
|
|
|
|
that choice of model parameters on that batch of data. Similarly, suppose we've
|
|
|
|
also defined a function that takes the same arguments and computes the gradient
|
|
|
|
of the loss for that choice of model parameters.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
def loss(theta, xs, ys):
|
2017-07-16 22:19:33 -07:00
|
|
|
# compute the loss on a batch of data
|
|
|
|
return loss
|
2017-03-11 15:30:31 -08:00
|
|
|
|
|
|
|
def grad(theta, xs, ys):
|
2017-07-16 22:19:33 -07:00
|
|
|
# compute the gradient on a batch of data
|
|
|
|
return grad
|
2017-03-11 15:30:31 -08:00
|
|
|
|
|
|
|
def full_loss(theta):
|
2017-07-16 22:19:33 -07:00
|
|
|
# compute the loss on the full data set
|
|
|
|
return sum([loss(theta, xs, ys) for (xs, ys) in batches])
|
2017-03-11 15:30:31 -08:00
|
|
|
|
|
|
|
def full_grad(theta):
|
2017-07-16 22:19:33 -07:00
|
|
|
# compute the gradient on the full data set
|
|
|
|
return sum([grad(theta, xs, ys) for (xs, ys) in batches])
|
2017-03-11 15:30:31 -08:00
|
|
|
|
|
|
|
Since we are working with a small dataset, we don't actually need to separate
|
|
|
|
these methods into the part that operates on a batch and the part that operates
|
|
|
|
on the full dataset, but doing so will make the distributed version clearer.
|
|
|
|
|
|
|
|
Now, if we wish to optimize the loss function using L-BFGS, we simply plug these
|
|
|
|
functions, along with an initial choice of model parameters, into
|
|
|
|
``scipy.optimize.fmin_l_bfgs_b``.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
theta_init = 1e-2 * np.random.normal(size=dim)
|
|
|
|
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)
|
|
|
|
|
|
|
|
The distributed version
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
In this example, the computation of the gradient itself can be done in parallel
|
|
|
|
on a number of workers or machines.
|
|
|
|
|
|
|
|
First, let's turn the data into a collection of remote objects.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
batch_ids = [(ray.put(xs), ray.put(ys)) for (xs, ys) in batches]
|
|
|
|
|
|
|
|
We can load the data on the driver and distribute it this way because MNIST
|
|
|
|
easily fits on a single machine. However, for larger data sets, we will need to
|
|
|
|
use remote functions to distribute the loading of the data.
|
|
|
|
|
|
|
|
Now, lets turn ``loss`` and ``grad`` into methods of an actor that will contain our network.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
class Network(object):
|
2017-07-16 22:19:33 -07:00
|
|
|
def __init__():
|
|
|
|
# Initialize network.
|
2017-03-11 15:30:31 -08:00
|
|
|
|
2017-07-16 22:19:33 -07:00
|
|
|
def loss(theta, xs, ys):
|
|
|
|
# compute the loss
|
|
|
|
return loss
|
2017-03-11 15:30:31 -08:00
|
|
|
|
2017-07-16 22:19:33 -07:00
|
|
|
def grad(theta, xs, ys):
|
|
|
|
# compute the gradient
|
|
|
|
return grad
|
2017-03-11 15:30:31 -08:00
|
|
|
|
|
|
|
Now, it is easy to speed up the computation of the full loss and the full
|
|
|
|
gradient.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
def full_loss(theta):
|
2017-07-16 22:19:33 -07:00
|
|
|
theta_id = ray.put(theta)
|
|
|
|
loss_ids = [actor.loss(theta_id) for actor in actors]
|
|
|
|
return sum(ray.get(loss_ids))
|
2017-03-11 15:30:31 -08:00
|
|
|
|
|
|
|
def full_grad(theta):
|
2017-07-16 22:19:33 -07:00
|
|
|
theta_id = ray.put(theta)
|
|
|
|
grad_ids = [actor.grad(theta_id) for actor in actors]
|
|
|
|
return sum(ray.get(grad_ids)).astype("float64") # This conversion is necessary for use with fmin_l_bfgs_b.
|
2017-03-11 15:30:31 -08:00
|
|
|
|
|
|
|
Note that we turn ``theta`` into a remote object with the line ``theta_id =
|
|
|
|
ray.put(theta)`` before passing it into the remote functions. If we had written
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
[actor.loss(theta_id) for actor in actors]
|
|
|
|
|
|
|
|
instead of
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
theta_id = ray.put(theta)
|
|
|
|
[actor.loss(theta_id) for actor in actors]
|
|
|
|
|
|
|
|
then each task that got sent to the scheduler (one for every element of
|
|
|
|
``batch_ids``) would have had a copy of ``theta`` serialized inside of it. Since
|
|
|
|
``theta`` here consists of the parameters of a potentially large model, this is
|
|
|
|
inefficient. *Large objects should be passed by object ID to remote functions
|
|
|
|
and not by value*.
|
|
|
|
|
|
|
|
We use remote actors and remote objects internally in the implementation of
|
|
|
|
``full_loss`` and ``full_grad``, but the user-facing behavior of these methods is
|
|
|
|
identical to the behavior in the serial version.
|
|
|
|
|
|
|
|
We can now optimize the objective with the same function call as before.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
theta_init = 1e-2 * np.random.normal(size=dim)
|
|
|
|
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)
|