[air] Remove checkpoint user guide and update key concepts and docstring (#27455)

This commit is contained in:
Richard Liaw 2022-08-04 08:55:26 -07:00 committed by GitHub
parent 8d5c07b781
commit b2cd34cc5c
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
8 changed files with 125 additions and 308 deletions

View file

@ -15,7 +15,6 @@ parts:
- file: ray-air/user-guides
sections:
- file: ray-air/preprocessors
- file: ray-air/checkpoints
- file: ray-air/check-ingest
- file: ray-air/trainer
- file: ray-air/tuner

View file

@ -1,89 +0,0 @@
.. _air-checkpoints-doc:
Using Checkpoints
=================
The AIR trainers, tuners, and custom pretrained model generate Checkpoints. An AIR Checkpoint is a common format for models that
are used across different components of the Ray AI Runtime. This common format allow easy interoperability among AIR components
and seamless integration with external supported machine learning frameworks.
.. image:: images/checkpoints.jpg
What is a checkpoint?
---------------------
A Checkpoint object is a serializable reference to a model. A model can be represented in one of three ways:
- as a directory on local (on-disk) storage
- as a directory on an external storage (e.g., cloud storage)
- as an in-memory dictionary
Because of these different model storage representation, Checkpoint models provide useful flexibility in
distributed environments, where you want to recreate an instance of the same model on multiple nodes or
across different Ray clusters.
How to create a checkpoint?
---------------------------
There are two ways to generate a checkpoint.
The first way is to generate it from a pretrained model. Each AIR supported machine learning (ML) framework has
a ``Checkpoint`` method that can be used to generate an AIR checkpoint:
.. literalinclude:: doc_code/checkpoint_usage.py
:language: python
:start-after: __checkpoint_quick_start__
:end-before: __checkpoint_quick_end__
Another way is to retrieve it from the result object returned by a Trainer or Tuner.
.. literalinclude:: doc_code/checkpoint_usage.py
:language: python
:start-after: __use_trainer_checkpoint_start__
:end-before: __use_trainer_checkpoint_end__
How to use a checkpoint?
------------------------
Checkpoints can be used to instantiate a :class:`Predictor`, :class:`BatchPredictor`, or :class:`PredictorDeployment` class.
An instance of this instantiated class (in memory) can be used for inference.
For instance, the code example below shows how a checkpoint in the :class:`BatchPredictor` is used for scalable batch inference:
.. literalinclude:: doc_code/checkpoint_usage.py
:language: python
:start-after: __batch_pred_start__
:end-before: __batch_pred_end__
Another example below demonstrates how to use a checkpoint for an online inference via :class:`PredictorDeployment`:
.. literalinclude:: doc_code/checkpoint_usage.py
:language: python
:start-after: __online_inference_start__
:end-before: __online_inference_end__
Furthermore, a Checkpoint object has methods to translate between different checkpoint storage locations.
With this flexibility, Checkpoint objects can be serialized and used in different contexts
(e.g., on a different process or a different machine):
.. literalinclude:: doc_code/checkpoint_usage.py
:language: python
:start-after: __basic_checkpoint_start__
:end-before: __basic_checkpoint_end__
Example: Using Checkpoints with MLflow
--------------------------------------
`MLflow <https://mlflow.org/>`__ has its own `checkpoint format <https://www.mlflow.org/docs/latest/models.html>`__ called
the "MLflow Model." It is a standard format to package machine learning models that can be used in a variety of downstream tools.
Below is an example of using MLflow models as a Ray AIR Checkpoint.
.. literalinclude:: doc_code/checkpoint_mlflow.py
:language: python
:start-after: __mlflow_checkpoint_start__
:end-before: __mlflow_checkpoint_end__

View file

@ -65,6 +65,37 @@ best_result = result_grid.get_best_result()
print(best_result)
# __air_tuner_end__
# __air_checkpoints_start__
checkpoint = result.checkpoint
print(checkpoint)
# Checkpoint(local_path=..../checkpoint_000005)
tuned_checkpoint = result_grid.get_best_result().checkpoint
print(tuned_checkpoint)
# Checkpoint(local_path=..../checkpoint_000005)
# __air_checkpoints_end__
# __checkpoint_adhoc_start__
from ray.train.tensorflow import TensorflowCheckpoint
import tensorflow as tf
# This can be a trained model.
def build_model() -> tf.keras.Model:
model = tf.keras.Sequential(
[
tf.keras.layers.InputLayer(input_shape=(1,)),
tf.keras.layers.Dense(1),
]
)
return model
model = build_model()
checkpoint = TensorflowCheckpoint.from_model(model)
# __checkpoint_adhoc_end__
# __air_batch_predictor_start__
from ray.train.batch_predictor import BatchPredictor
from ray.train.xgboost import XGBoostPredictor

View file

@ -1,122 +0,0 @@
# flake8: noqa
# isort: skip_file
# __checkpoint_quick_start__
from ray.train.tensorflow import TensorflowCheckpoint
import tensorflow as tf
# This can be a trained model.
def build_model() -> tf.keras.Model:
model = tf.keras.Sequential(
[
tf.keras.layers.InputLayer(input_shape=(1,)),
tf.keras.layers.Dense(1),
]
)
return model
model = build_model()
checkpoint = TensorflowCheckpoint.from_model(model)
# __checkpoint_quick_end__
# __use_trainer_checkpoint_start__
import ray
from ray.train.xgboost import XGBoostTrainer
from ray.air.config import ScalingConfig
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
# Split data into train and validation.
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
trainer = XGBoostTrainer(
scaling_config=ScalingConfig(num_workers=2),
label_column="target",
params={
"objective": "binary:logistic",
"eval_metric": ["logloss", "error"],
},
datasets={"train": train_dataset},
num_boost_round=5,
)
result = trainer.fit()
checkpoint = result.checkpoint
# __use_trainer_checkpoint_end__
# __batch_pred_start__
from ray.train.batch_predictor import BatchPredictor
from ray.train.xgboost import XGBoostPredictor
# Create a test dataset by dropping the target column.
test_dataset = valid_dataset.drop_columns(["target"])
batch_predictor = BatchPredictor.from_checkpoint(checkpoint, XGBoostPredictor)
# Bulk batch prediction.
batch_predictor.predict(test_dataset)
# __batch_pred_end__
# __online_inference_start__
import requests
from fastapi import Request
import pandas as pd
from ray import serve
from ray.serve import PredictorDeployment
from ray.serve.http_adapters import json_request
async def adapter(request: Request):
content = await request.json()
print(content)
return pd.DataFrame.from_dict(content)
serve.start(detached=True)
deployment = PredictorDeployment.options(name="XGBoostService")
deployment.deploy(
XGBoostPredictor, checkpoint, batching_params=False, http_adapter=adapter
)
print(deployment.url)
sample_input = test_dataset.take(1)
sample_input = dict(sample_input[0])
output = requests.post(deployment.url, json=[sample_input]).json()
print(output)
# __online_inference_end__
# __basic_checkpoint_start__
from ray.air.checkpoint import Checkpoint
# Create checkpoint data dict
checkpoint_data = {"data": 123}
# Create checkpoint object from data
checkpoint = Checkpoint.from_dict(checkpoint_data)
# Save checkpoint to a directory on the file system.
path = checkpoint.to_directory()
# This path can then be passed around,
# # e.g. to a different function or a different script.
# You can also use `checkpoint.to_uri/from_uri` to
# read from/write to cloud storage
# In another function or script, recover Checkpoint object from path
checkpoint = Checkpoint.from_directory(path)
# Convert into dictionary again
recovered_data = checkpoint.to_dict()
# It is guaranteed that the original data has been recovered
assert recovered_data == checkpoint_data
# __basic_checkpoint_end__

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

View file

@ -43,9 +43,8 @@ See the documentation on :ref:`Trainers <air-trainers>`.
:start-after: __air_trainer_start__
:end-before: __air_trainer_end__
Trainer objects will produce a :ref:`Result <air-results-ref>` object after calling ``.fit()``. These objects will contain training metrics as long as checkpoints to retrieve the best model.
Trainer objects produce a :ref:`Result <air-results-ref>` object after calling ``.fit()``.
These objects contain training metrics as well as checkpoints to retrieve the best model.
.. literalinclude:: doc_code/air_key_concepts.py
:language: python
@ -65,11 +64,40 @@ Tuners can work seamlessly with any Trainer but also can support arbitrary train
:start-after: __air_tuner_start__
:end-before: __air_tuner_end__
.. _air-checkpoints-doc:
Checkpoints
-----------
The AIR trainers, tuners, and custom pretrained model generate :class:`a framework-specific Checkpoint <ray.air.Checkpoint>` object.
Checkpoints are a common interface for models that are used across different AIR components and libraries.
There are two main ways to generate a checkpoint.
Checkpoint objects can be retrieved from the Result object returned by a Trainer or Tuner ``.fit()`` call.
.. literalinclude:: doc_code/air_key_concepts.py
:language: python
:start-after: __air_checkpoints_start__
:end-before: __air_checkpoints_end__
You can also generate a checkpoint from a pretrained model. Each AIR supported machine learning (ML) framework has
a ``Checkpoint`` object that can be used to generate an AIR checkpoint:
.. literalinclude:: doc_code/air_key_concepts.py
:language: python
:start-after: __checkpoint_adhoc_start__
:end-before: __checkpoint_adhoc_end__
Checkpoints can be used to instantiate a :class:`Predictor`, :class:`BatchPredictor`, or :class:`PredictorDeployment` classes,
as seen below.
Batch Predictor
---------------
You can take a trained model and do batch inference using the BatchPredictor object.
You can take a checkpoint and do batch inference using the BatchPredictor object.
.. literalinclude:: doc_code/air_key_concepts.py
:language: python

View file

@ -23,17 +23,6 @@ AIR User Guides
:text: Using Preprocessors
:classes: btn-link btn-block stretched-link
---
:img-top: /ray-overview/images/ray_svg_logo.svg
+++
.. link-button:: /ray-air/checkpoints
:type: ref
:text: Using Checkpoints
:classes: btn-link btn-block stretched-link
---
:img-top: /ray-overview/images/ray_svg_logo.svg

View file

@ -42,34 +42,20 @@ logger = logging.getLogger(__name__)
class Checkpoint:
"""Ray AIR Checkpoint.
This implementation provides methods to translate between
different checkpoint storage locations: Local storage, external storage
(e.g. cloud storage), and data dict representations.
An AIR Checkpoint are a common interface for accessing models across
different AIR components and libraries. A Checkpoint can have its data
represented in one of three ways:
The constructor is a private API, instead the ``from_`` methods should
be used to create checkpoint objects
(e.g. ``Checkpoint.from_directory()``).
- as a directory on local (on-disk) storage
- as a directory on an external storage (e.g., cloud storage)
- as an in-memory dictionary
When converting between different checkpoint formats, it is guaranteed
that a full round trip of conversions (e.g. directory --> dict -->
obj ref --> directory) will recover the original checkpoint data.
There are no guarantees made about compatibility of intermediate
representations.
The Checkpoint object also has methods to translate between different checkpoint
storage locations. These storage representations provide flexibility in
distributed environments, where you may want to recreate an instance of
the same model on multiple nodes or across different Ray clusters.
New data can be added to a Checkpoint during conversion. Consider the
following conversion: directory --> dict (adding dict["foo"] = "bar")
--> directory --> dict (expect to see dict["foo"] = "bar"). Note that
the second directory will contain pickle files with the serialized additional
field data in them.
Similarly with a dict as a source: dict --> directory (add file "foo.txt")
--> dict --> directory (will have "foo.txt" in it again). Note that the second
dict representation will contain an extra field with the serialized additional
files in it.
Examples:
Example for an arbitrary data checkpoint:
Example:
.. code-block:: python
@ -81,12 +67,15 @@ class Checkpoint:
# Create checkpoint object from data
checkpoint = Checkpoint.from_dict(checkpoint_data)
# Save checkpoint to temporary location
# Save checkpoint to a directory on the file system.
path = checkpoint.to_directory()
# This path can then be passed around, e.g. to a different function
# This path can then be passed around,
# # e.g. to a different function or a different script.
# You can also use `checkpoint.to_uri/from_uri` to
# read from/write to cloud storage
# At some other location, recover Checkpoint object from path
# In another function or script, recover Checkpoint object from path
checkpoint = Checkpoint.from_directory(path)
# Convert into dictionary again
@ -95,39 +84,31 @@ class Checkpoint:
# It is guaranteed that the original data has been recovered
assert recovered_data == checkpoint_data
Example using MLflow for saving and loading a classifier:
Checkpoints can be used to instantiate a :class:`Predictor`,
:class:`BatchPredictor`, or :class:`PredictorDeployment` class.
.. code-block:: python
The constructor is a private API, instead the ``from_`` methods should
be used to create checkpoint objects
(e.g. ``Checkpoint.from_directory()``).
from ray.air.checkpoint import Checkpoint
from sklearn.ensemble import RandomForestClassifier
import mlflow.sklearn
*Other implementation notes:*
When converting between different checkpoint formats, it is guaranteed
that a full round trip of conversions (e.g. directory --> dict -->
obj ref --> directory) will recover the original checkpoint data.
There are no guarantees made about compatibility of intermediate
representations.
# Create an sklearn classifier
clf = RandomForestClassifier(max_depth=7, random_state=0)
# ... e.g. train model with clf.fit()
# Save model using MLflow
mlflow.sklearn.save_model(clf, "model_directory")
New data can be added to a Checkpoint
during conversion. Consider the following conversion:
directory --> dict (adding dict["foo"] = "bar")
--> directory --> dict (expect to see dict["foo"] = "bar"). Note that
the second directory will contain pickle files with the serialized additional
field data in them.
# Create checkpoint object from path
checkpoint = Checkpoint.from_directory("model_directory")
# Convert into dictionary
checkpoint_dict = checkpoint.to_dict()
# This dict can then be passed around, e.g. to a different function
# At some other location, recover checkpoint object from dict
checkpoint = Checkpoint.from_dict(checkpoint_dict)
# Convert into a directory again
checkpoint.to_directory("other_directory")
# We can now use MLflow to re-load the model
clf = mlflow.sklearn.load_model("other_directory")
# It is guaranteed that the original data was recovered
assert isinstance(clf, RandomForestClassifier)
Similarly with a dict as a source: dict --> directory (add file "foo.txt")
--> dict --> directory (will have "foo.txt" in it again). Note that the second
dict representation will contain an extra field with the serialized additional
files in it.
Checkpoints can be pickled and sent to remote processes.
Please note that checkpoints pointing to local directories will be