mirror of
https://github.com/vale981/ray
synced 2025-03-06 02:21:39 -05:00

Co-authored-by: Jules Damji <jules@anyscale.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
89 lines
3.4 KiB
ReStructuredText
89 lines
3.4 KiB
ReStructuredText
.. _air-checkpoints-doc:
|
|
|
|
Using Checkpoints
|
|
=================
|
|
|
|
The AIR trainers, tuners, and custom pretrained model generate Checkpoints. An AIR Checkpoint is a common format for models that
|
|
are used across different components of the Ray AI Runtime. This common format allow easy interoperability among AIR components
|
|
and seamless integration with external supported machine learning frameworks.
|
|
|
|
.. image:: images/checkpoints.jpg
|
|
|
|
What is a checkpoint?
|
|
---------------------
|
|
|
|
A Checkpoint object is a serializable reference to a model. A model can be represented in one of three ways:
|
|
|
|
- as a directory on local (on-disk) storage
|
|
- as a directory on an external storage (e.g., cloud storage)
|
|
- as an in-memory dictionary
|
|
|
|
Because of these different model storage representation, Checkpoint models provide useful flexibility in
|
|
distributed environments, where you want to recreate an instance of the same model on multiple nodes or
|
|
across different Ray clusters.
|
|
|
|
How to create a checkpoint?
|
|
---------------------------
|
|
|
|
There are two ways to generate a checkpoint.
|
|
|
|
The first way is to generate it from a pretrained model. Each AIR supported machine learning (ML) framework has
|
|
a ``Checkpoint`` method that can be used to generate an AIR checkpoint:
|
|
|
|
.. literalinclude:: doc_code/checkpoint_usage.py
|
|
:language: python
|
|
:start-after: __checkpoint_quick_start__
|
|
:end-before: __checkpoint_quick_end__
|
|
|
|
|
|
Another way is to retrieve it from the result object returned by a Trainer or Tuner.
|
|
|
|
.. literalinclude:: doc_code/checkpoint_usage.py
|
|
:language: python
|
|
:start-after: __use_trainer_checkpoint_start__
|
|
:end-before: __use_trainer_checkpoint_end__
|
|
|
|
How to use a checkpoint?
|
|
------------------------
|
|
|
|
Checkpoints can be used to instantiate a :class:`Predictor`, :class:`BatchPredictor`, or :class:`PredictorDeployment` class.
|
|
An instance of this instantiated class (in memory) can be used for inference.
|
|
|
|
For instance, the code example below shows how a checkpoint in the :class:`BatchPredictor` is used for scalable batch inference:
|
|
|
|
.. literalinclude:: doc_code/checkpoint_usage.py
|
|
:language: python
|
|
:start-after: __batch_pred_start__
|
|
:end-before: __batch_pred_end__
|
|
|
|
Another example below demonstrates how to use a checkpoint for an online inference via :class:`PredictorDeployment`:
|
|
|
|
.. literalinclude:: doc_code/checkpoint_usage.py
|
|
:language: python
|
|
:start-after: __online_inference_start__
|
|
:end-before: __online_inference_end__
|
|
|
|
Furthermore, a Checkpoint object has methods to translate between different checkpoint storage locations.
|
|
With this flexibility, Checkpoint objects can be serialized and used in different contexts
|
|
(e.g., on a different process or a different machine):
|
|
|
|
.. literalinclude:: doc_code/checkpoint_usage.py
|
|
:language: python
|
|
:start-after: __basic_checkpoint_start__
|
|
:end-before: __basic_checkpoint_end__
|
|
|
|
|
|
Example: Using Checkpoints with MLflow
|
|
--------------------------------------
|
|
|
|
`MLflow <https://mlflow.org/>`__ has its own `checkpoint format <https://www.mlflow.org/docs/latest/models.html>`__ called
|
|
the "MLflow Model." It is a standard format to package machine learning models that can be used in a variety of downstream tools.
|
|
|
|
Below is an example of using MLflow models as a Ray AIR Checkpoint.
|
|
|
|
.. literalinclude:: doc_code/checkpoint_mlflow.py
|
|
:language: python
|
|
:start-after: __mlflow_checkpoint_start__
|
|
:end-before: __mlflow_checkpoint_end__
|
|
|
|
|