mirror of
https://github.com/vale981/ray
synced 2025-03-06 02:21:39 -05:00
[air] Remove checkpoint user guide and update key concepts and docstring (#27455)
This commit is contained in:
parent
8d5c07b781
commit
b2cd34cc5c
8 changed files with 125 additions and 308 deletions
|
@ -15,7 +15,6 @@ parts:
|
||||||
- file: ray-air/user-guides
|
- file: ray-air/user-guides
|
||||||
sections:
|
sections:
|
||||||
- file: ray-air/preprocessors
|
- file: ray-air/preprocessors
|
||||||
- file: ray-air/checkpoints
|
|
||||||
- file: ray-air/check-ingest
|
- file: ray-air/check-ingest
|
||||||
- file: ray-air/trainer
|
- file: ray-air/trainer
|
||||||
- file: ray-air/tuner
|
- file: ray-air/tuner
|
||||||
|
|
|
@ -1,89 +0,0 @@
|
||||||
.. _air-checkpoints-doc:
|
|
||||||
|
|
||||||
Using Checkpoints
|
|
||||||
=================
|
|
||||||
|
|
||||||
The AIR trainers, tuners, and custom pretrained model generate Checkpoints. An AIR Checkpoint is a common format for models that
|
|
||||||
are used across different components of the Ray AI Runtime. This common format allow easy interoperability among AIR components
|
|
||||||
and seamless integration with external supported machine learning frameworks.
|
|
||||||
|
|
||||||
.. image:: images/checkpoints.jpg
|
|
||||||
|
|
||||||
What is a checkpoint?
|
|
||||||
---------------------
|
|
||||||
|
|
||||||
A Checkpoint object is a serializable reference to a model. A model can be represented in one of three ways:
|
|
||||||
|
|
||||||
- as a directory on local (on-disk) storage
|
|
||||||
- as a directory on an external storage (e.g., cloud storage)
|
|
||||||
- as an in-memory dictionary
|
|
||||||
|
|
||||||
Because of these different model storage representation, Checkpoint models provide useful flexibility in
|
|
||||||
distributed environments, where you want to recreate an instance of the same model on multiple nodes or
|
|
||||||
across different Ray clusters.
|
|
||||||
|
|
||||||
How to create a checkpoint?
|
|
||||||
---------------------------
|
|
||||||
|
|
||||||
There are two ways to generate a checkpoint.
|
|
||||||
|
|
||||||
The first way is to generate it from a pretrained model. Each AIR supported machine learning (ML) framework has
|
|
||||||
a ``Checkpoint`` method that can be used to generate an AIR checkpoint:
|
|
||||||
|
|
||||||
.. literalinclude:: doc_code/checkpoint_usage.py
|
|
||||||
:language: python
|
|
||||||
:start-after: __checkpoint_quick_start__
|
|
||||||
:end-before: __checkpoint_quick_end__
|
|
||||||
|
|
||||||
|
|
||||||
Another way is to retrieve it from the result object returned by a Trainer or Tuner.
|
|
||||||
|
|
||||||
.. literalinclude:: doc_code/checkpoint_usage.py
|
|
||||||
:language: python
|
|
||||||
:start-after: __use_trainer_checkpoint_start__
|
|
||||||
:end-before: __use_trainer_checkpoint_end__
|
|
||||||
|
|
||||||
How to use a checkpoint?
|
|
||||||
------------------------
|
|
||||||
|
|
||||||
Checkpoints can be used to instantiate a :class:`Predictor`, :class:`BatchPredictor`, or :class:`PredictorDeployment` class.
|
|
||||||
An instance of this instantiated class (in memory) can be used for inference.
|
|
||||||
|
|
||||||
For instance, the code example below shows how a checkpoint in the :class:`BatchPredictor` is used for scalable batch inference:
|
|
||||||
|
|
||||||
.. literalinclude:: doc_code/checkpoint_usage.py
|
|
||||||
:language: python
|
|
||||||
:start-after: __batch_pred_start__
|
|
||||||
:end-before: __batch_pred_end__
|
|
||||||
|
|
||||||
Another example below demonstrates how to use a checkpoint for an online inference via :class:`PredictorDeployment`:
|
|
||||||
|
|
||||||
.. literalinclude:: doc_code/checkpoint_usage.py
|
|
||||||
:language: python
|
|
||||||
:start-after: __online_inference_start__
|
|
||||||
:end-before: __online_inference_end__
|
|
||||||
|
|
||||||
Furthermore, a Checkpoint object has methods to translate between different checkpoint storage locations.
|
|
||||||
With this flexibility, Checkpoint objects can be serialized and used in different contexts
|
|
||||||
(e.g., on a different process or a different machine):
|
|
||||||
|
|
||||||
.. literalinclude:: doc_code/checkpoint_usage.py
|
|
||||||
:language: python
|
|
||||||
:start-after: __basic_checkpoint_start__
|
|
||||||
:end-before: __basic_checkpoint_end__
|
|
||||||
|
|
||||||
|
|
||||||
Example: Using Checkpoints with MLflow
|
|
||||||
--------------------------------------
|
|
||||||
|
|
||||||
`MLflow <https://mlflow.org/>`__ has its own `checkpoint format <https://www.mlflow.org/docs/latest/models.html>`__ called
|
|
||||||
the "MLflow Model." It is a standard format to package machine learning models that can be used in a variety of downstream tools.
|
|
||||||
|
|
||||||
Below is an example of using MLflow models as a Ray AIR Checkpoint.
|
|
||||||
|
|
||||||
.. literalinclude:: doc_code/checkpoint_mlflow.py
|
|
||||||
:language: python
|
|
||||||
:start-after: __mlflow_checkpoint_start__
|
|
||||||
:end-before: __mlflow_checkpoint_end__
|
|
||||||
|
|
||||||
|
|
|
@ -65,6 +65,37 @@ best_result = result_grid.get_best_result()
|
||||||
print(best_result)
|
print(best_result)
|
||||||
# __air_tuner_end__
|
# __air_tuner_end__
|
||||||
|
|
||||||
|
# __air_checkpoints_start__
|
||||||
|
checkpoint = result.checkpoint
|
||||||
|
print(checkpoint)
|
||||||
|
# Checkpoint(local_path=..../checkpoint_000005)
|
||||||
|
|
||||||
|
tuned_checkpoint = result_grid.get_best_result().checkpoint
|
||||||
|
print(tuned_checkpoint)
|
||||||
|
# Checkpoint(local_path=..../checkpoint_000005)
|
||||||
|
# __air_checkpoints_end__
|
||||||
|
|
||||||
|
# __checkpoint_adhoc_start__
|
||||||
|
from ray.train.tensorflow import TensorflowCheckpoint
|
||||||
|
import tensorflow as tf
|
||||||
|
|
||||||
|
# This can be a trained model.
|
||||||
|
def build_model() -> tf.keras.Model:
|
||||||
|
model = tf.keras.Sequential(
|
||||||
|
[
|
||||||
|
tf.keras.layers.InputLayer(input_shape=(1,)),
|
||||||
|
tf.keras.layers.Dense(1),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
model = build_model()
|
||||||
|
|
||||||
|
checkpoint = TensorflowCheckpoint.from_model(model)
|
||||||
|
# __checkpoint_adhoc_end__
|
||||||
|
|
||||||
|
|
||||||
# __air_batch_predictor_start__
|
# __air_batch_predictor_start__
|
||||||
from ray.train.batch_predictor import BatchPredictor
|
from ray.train.batch_predictor import BatchPredictor
|
||||||
from ray.train.xgboost import XGBoostPredictor
|
from ray.train.xgboost import XGBoostPredictor
|
||||||
|
|
|
@ -1,122 +0,0 @@
|
||||||
# flake8: noqa
|
|
||||||
# isort: skip_file
|
|
||||||
|
|
||||||
# __checkpoint_quick_start__
|
|
||||||
from ray.train.tensorflow import TensorflowCheckpoint
|
|
||||||
import tensorflow as tf
|
|
||||||
|
|
||||||
# This can be a trained model.
|
|
||||||
def build_model() -> tf.keras.Model:
|
|
||||||
model = tf.keras.Sequential(
|
|
||||||
[
|
|
||||||
tf.keras.layers.InputLayer(input_shape=(1,)),
|
|
||||||
tf.keras.layers.Dense(1),
|
|
||||||
]
|
|
||||||
)
|
|
||||||
return model
|
|
||||||
|
|
||||||
|
|
||||||
model = build_model()
|
|
||||||
|
|
||||||
checkpoint = TensorflowCheckpoint.from_model(model)
|
|
||||||
# __checkpoint_quick_end__
|
|
||||||
|
|
||||||
|
|
||||||
# __use_trainer_checkpoint_start__
|
|
||||||
import ray
|
|
||||||
from ray.train.xgboost import XGBoostTrainer
|
|
||||||
from ray.air.config import ScalingConfig
|
|
||||||
|
|
||||||
|
|
||||||
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
|
|
||||||
|
|
||||||
# Split data into train and validation.
|
|
||||||
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
|
|
||||||
|
|
||||||
trainer = XGBoostTrainer(
|
|
||||||
scaling_config=ScalingConfig(num_workers=2),
|
|
||||||
label_column="target",
|
|
||||||
params={
|
|
||||||
"objective": "binary:logistic",
|
|
||||||
"eval_metric": ["logloss", "error"],
|
|
||||||
},
|
|
||||||
datasets={"train": train_dataset},
|
|
||||||
num_boost_round=5,
|
|
||||||
)
|
|
||||||
|
|
||||||
result = trainer.fit()
|
|
||||||
checkpoint = result.checkpoint
|
|
||||||
# __use_trainer_checkpoint_end__
|
|
||||||
|
|
||||||
# __batch_pred_start__
|
|
||||||
from ray.train.batch_predictor import BatchPredictor
|
|
||||||
from ray.train.xgboost import XGBoostPredictor
|
|
||||||
|
|
||||||
# Create a test dataset by dropping the target column.
|
|
||||||
test_dataset = valid_dataset.drop_columns(["target"])
|
|
||||||
|
|
||||||
batch_predictor = BatchPredictor.from_checkpoint(checkpoint, XGBoostPredictor)
|
|
||||||
|
|
||||||
# Bulk batch prediction.
|
|
||||||
batch_predictor.predict(test_dataset)
|
|
||||||
# __batch_pred_end__
|
|
||||||
|
|
||||||
|
|
||||||
# __online_inference_start__
|
|
||||||
import requests
|
|
||||||
from fastapi import Request
|
|
||||||
import pandas as pd
|
|
||||||
|
|
||||||
from ray import serve
|
|
||||||
from ray.serve import PredictorDeployment
|
|
||||||
from ray.serve.http_adapters import json_request
|
|
||||||
|
|
||||||
|
|
||||||
async def adapter(request: Request):
|
|
||||||
content = await request.json()
|
|
||||||
print(content)
|
|
||||||
return pd.DataFrame.from_dict(content)
|
|
||||||
|
|
||||||
|
|
||||||
serve.start(detached=True)
|
|
||||||
deployment = PredictorDeployment.options(name="XGBoostService")
|
|
||||||
|
|
||||||
deployment.deploy(
|
|
||||||
XGBoostPredictor, checkpoint, batching_params=False, http_adapter=adapter
|
|
||||||
)
|
|
||||||
|
|
||||||
print(deployment.url)
|
|
||||||
|
|
||||||
sample_input = test_dataset.take(1)
|
|
||||||
sample_input = dict(sample_input[0])
|
|
||||||
|
|
||||||
output = requests.post(deployment.url, json=[sample_input]).json()
|
|
||||||
print(output)
|
|
||||||
# __online_inference_end__
|
|
||||||
|
|
||||||
# __basic_checkpoint_start__
|
|
||||||
from ray.air.checkpoint import Checkpoint
|
|
||||||
|
|
||||||
# Create checkpoint data dict
|
|
||||||
checkpoint_data = {"data": 123}
|
|
||||||
|
|
||||||
# Create checkpoint object from data
|
|
||||||
checkpoint = Checkpoint.from_dict(checkpoint_data)
|
|
||||||
|
|
||||||
# Save checkpoint to a directory on the file system.
|
|
||||||
path = checkpoint.to_directory()
|
|
||||||
|
|
||||||
# This path can then be passed around,
|
|
||||||
# # e.g. to a different function or a different script.
|
|
||||||
# You can also use `checkpoint.to_uri/from_uri` to
|
|
||||||
# read from/write to cloud storage
|
|
||||||
|
|
||||||
# In another function or script, recover Checkpoint object from path
|
|
||||||
checkpoint = Checkpoint.from_directory(path)
|
|
||||||
|
|
||||||
# Convert into dictionary again
|
|
||||||
recovered_data = checkpoint.to_dict()
|
|
||||||
|
|
||||||
# It is guaranteed that the original data has been recovered
|
|
||||||
assert recovered_data == checkpoint_data
|
|
||||||
# __basic_checkpoint_end__
|
|
Binary file not shown.
Before Width: | Height: | Size: 22 KiB |
|
@ -43,9 +43,8 @@ See the documentation on :ref:`Trainers <air-trainers>`.
|
||||||
:start-after: __air_trainer_start__
|
:start-after: __air_trainer_start__
|
||||||
:end-before: __air_trainer_end__
|
:end-before: __air_trainer_end__
|
||||||
|
|
||||||
|
Trainer objects produce a :ref:`Result <air-results-ref>` object after calling ``.fit()``.
|
||||||
|
These objects contain training metrics as well as checkpoints to retrieve the best model.
|
||||||
Trainer objects will produce a :ref:`Result <air-results-ref>` object after calling ``.fit()``. These objects will contain training metrics as long as checkpoints to retrieve the best model.
|
|
||||||
|
|
||||||
.. literalinclude:: doc_code/air_key_concepts.py
|
.. literalinclude:: doc_code/air_key_concepts.py
|
||||||
:language: python
|
:language: python
|
||||||
|
@ -65,11 +64,40 @@ Tuners can work seamlessly with any Trainer but also can support arbitrary train
|
||||||
:start-after: __air_tuner_start__
|
:start-after: __air_tuner_start__
|
||||||
:end-before: __air_tuner_end__
|
:end-before: __air_tuner_end__
|
||||||
|
|
||||||
|
.. _air-checkpoints-doc:
|
||||||
|
|
||||||
|
Checkpoints
|
||||||
|
-----------
|
||||||
|
|
||||||
|
The AIR trainers, tuners, and custom pretrained model generate :class:`a framework-specific Checkpoint <ray.air.Checkpoint>` object.
|
||||||
|
Checkpoints are a common interface for models that are used across different AIR components and libraries.
|
||||||
|
|
||||||
|
There are two main ways to generate a checkpoint.
|
||||||
|
|
||||||
|
Checkpoint objects can be retrieved from the Result object returned by a Trainer or Tuner ``.fit()`` call.
|
||||||
|
|
||||||
|
.. literalinclude:: doc_code/air_key_concepts.py
|
||||||
|
:language: python
|
||||||
|
:start-after: __air_checkpoints_start__
|
||||||
|
:end-before: __air_checkpoints_end__
|
||||||
|
|
||||||
|
You can also generate a checkpoint from a pretrained model. Each AIR supported machine learning (ML) framework has
|
||||||
|
a ``Checkpoint`` object that can be used to generate an AIR checkpoint:
|
||||||
|
|
||||||
|
.. literalinclude:: doc_code/air_key_concepts.py
|
||||||
|
:language: python
|
||||||
|
:start-after: __checkpoint_adhoc_start__
|
||||||
|
:end-before: __checkpoint_adhoc_end__
|
||||||
|
|
||||||
|
|
||||||
|
Checkpoints can be used to instantiate a :class:`Predictor`, :class:`BatchPredictor`, or :class:`PredictorDeployment` classes,
|
||||||
|
as seen below.
|
||||||
|
|
||||||
|
|
||||||
Batch Predictor
|
Batch Predictor
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
You can take a trained model and do batch inference using the BatchPredictor object.
|
You can take a checkpoint and do batch inference using the BatchPredictor object.
|
||||||
|
|
||||||
.. literalinclude:: doc_code/air_key_concepts.py
|
.. literalinclude:: doc_code/air_key_concepts.py
|
||||||
:language: python
|
:language: python
|
||||||
|
|
|
@ -23,17 +23,6 @@ AIR User Guides
|
||||||
:text: Using Preprocessors
|
:text: Using Preprocessors
|
||||||
:classes: btn-link btn-block stretched-link
|
:classes: btn-link btn-block stretched-link
|
||||||
|
|
||||||
|
|
||||||
---
|
|
||||||
:img-top: /ray-overview/images/ray_svg_logo.svg
|
|
||||||
|
|
||||||
+++
|
|
||||||
.. link-button:: /ray-air/checkpoints
|
|
||||||
:type: ref
|
|
||||||
:text: Using Checkpoints
|
|
||||||
:classes: btn-link btn-block stretched-link
|
|
||||||
|
|
||||||
|
|
||||||
---
|
---
|
||||||
:img-top: /ray-overview/images/ray_svg_logo.svg
|
:img-top: /ray-overview/images/ray_svg_logo.svg
|
||||||
|
|
||||||
|
|
|
@ -42,34 +42,20 @@ logger = logging.getLogger(__name__)
|
||||||
class Checkpoint:
|
class Checkpoint:
|
||||||
"""Ray AIR Checkpoint.
|
"""Ray AIR Checkpoint.
|
||||||
|
|
||||||
This implementation provides methods to translate between
|
An AIR Checkpoint are a common interface for accessing models across
|
||||||
different checkpoint storage locations: Local storage, external storage
|
different AIR components and libraries. A Checkpoint can have its data
|
||||||
(e.g. cloud storage), and data dict representations.
|
represented in one of three ways:
|
||||||
|
|
||||||
The constructor is a private API, instead the ``from_`` methods should
|
- as a directory on local (on-disk) storage
|
||||||
be used to create checkpoint objects
|
- as a directory on an external storage (e.g., cloud storage)
|
||||||
(e.g. ``Checkpoint.from_directory()``).
|
- as an in-memory dictionary
|
||||||
|
|
||||||
When converting between different checkpoint formats, it is guaranteed
|
The Checkpoint object also has methods to translate between different checkpoint
|
||||||
that a full round trip of conversions (e.g. directory --> dict -->
|
storage locations. These storage representations provide flexibility in
|
||||||
obj ref --> directory) will recover the original checkpoint data.
|
distributed environments, where you may want to recreate an instance of
|
||||||
There are no guarantees made about compatibility of intermediate
|
the same model on multiple nodes or across different Ray clusters.
|
||||||
representations.
|
|
||||||
|
|
||||||
New data can be added to a Checkpoint during conversion. Consider the
|
Example:
|
||||||
following conversion: directory --> dict (adding dict["foo"] = "bar")
|
|
||||||
--> directory --> dict (expect to see dict["foo"] = "bar"). Note that
|
|
||||||
the second directory will contain pickle files with the serialized additional
|
|
||||||
field data in them.
|
|
||||||
|
|
||||||
Similarly with a dict as a source: dict --> directory (add file "foo.txt")
|
|
||||||
--> dict --> directory (will have "foo.txt" in it again). Note that the second
|
|
||||||
dict representation will contain an extra field with the serialized additional
|
|
||||||
files in it.
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
|
|
||||||
Example for an arbitrary data checkpoint:
|
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
|
@ -81,12 +67,15 @@ class Checkpoint:
|
||||||
# Create checkpoint object from data
|
# Create checkpoint object from data
|
||||||
checkpoint = Checkpoint.from_dict(checkpoint_data)
|
checkpoint = Checkpoint.from_dict(checkpoint_data)
|
||||||
|
|
||||||
# Save checkpoint to temporary location
|
# Save checkpoint to a directory on the file system.
|
||||||
path = checkpoint.to_directory()
|
path = checkpoint.to_directory()
|
||||||
|
|
||||||
# This path can then be passed around, e.g. to a different function
|
# This path can then be passed around,
|
||||||
|
# # e.g. to a different function or a different script.
|
||||||
|
# You can also use `checkpoint.to_uri/from_uri` to
|
||||||
|
# read from/write to cloud storage
|
||||||
|
|
||||||
# At some other location, recover Checkpoint object from path
|
# In another function or script, recover Checkpoint object from path
|
||||||
checkpoint = Checkpoint.from_directory(path)
|
checkpoint = Checkpoint.from_directory(path)
|
||||||
|
|
||||||
# Convert into dictionary again
|
# Convert into dictionary again
|
||||||
|
@ -95,39 +84,31 @@ class Checkpoint:
|
||||||
# It is guaranteed that the original data has been recovered
|
# It is guaranteed that the original data has been recovered
|
||||||
assert recovered_data == checkpoint_data
|
assert recovered_data == checkpoint_data
|
||||||
|
|
||||||
Example using MLflow for saving and loading a classifier:
|
Checkpoints can be used to instantiate a :class:`Predictor`,
|
||||||
|
:class:`BatchPredictor`, or :class:`PredictorDeployment` class.
|
||||||
|
|
||||||
.. code-block:: python
|
The constructor is a private API, instead the ``from_`` methods should
|
||||||
|
be used to create checkpoint objects
|
||||||
|
(e.g. ``Checkpoint.from_directory()``).
|
||||||
|
|
||||||
from ray.air.checkpoint import Checkpoint
|
*Other implementation notes:*
|
||||||
from sklearn.ensemble import RandomForestClassifier
|
When converting between different checkpoint formats, it is guaranteed
|
||||||
import mlflow.sklearn
|
that a full round trip of conversions (e.g. directory --> dict -->
|
||||||
|
obj ref --> directory) will recover the original checkpoint data.
|
||||||
|
There are no guarantees made about compatibility of intermediate
|
||||||
|
representations.
|
||||||
|
|
||||||
# Create an sklearn classifier
|
New data can be added to a Checkpoint
|
||||||
clf = RandomForestClassifier(max_depth=7, random_state=0)
|
during conversion. Consider the following conversion:
|
||||||
# ... e.g. train model with clf.fit()
|
directory --> dict (adding dict["foo"] = "bar")
|
||||||
# Save model using MLflow
|
--> directory --> dict (expect to see dict["foo"] = "bar"). Note that
|
||||||
mlflow.sklearn.save_model(clf, "model_directory")
|
the second directory will contain pickle files with the serialized additional
|
||||||
|
field data in them.
|
||||||
|
|
||||||
# Create checkpoint object from path
|
Similarly with a dict as a source: dict --> directory (add file "foo.txt")
|
||||||
checkpoint = Checkpoint.from_directory("model_directory")
|
--> dict --> directory (will have "foo.txt" in it again). Note that the second
|
||||||
|
dict representation will contain an extra field with the serialized additional
|
||||||
# Convert into dictionary
|
files in it.
|
||||||
checkpoint_dict = checkpoint.to_dict()
|
|
||||||
|
|
||||||
# This dict can then be passed around, e.g. to a different function
|
|
||||||
|
|
||||||
# At some other location, recover checkpoint object from dict
|
|
||||||
checkpoint = Checkpoint.from_dict(checkpoint_dict)
|
|
||||||
|
|
||||||
# Convert into a directory again
|
|
||||||
checkpoint.to_directory("other_directory")
|
|
||||||
|
|
||||||
# We can now use MLflow to re-load the model
|
|
||||||
clf = mlflow.sklearn.load_model("other_directory")
|
|
||||||
|
|
||||||
# It is guaranteed that the original data was recovered
|
|
||||||
assert isinstance(clf, RandomForestClassifier)
|
|
||||||
|
|
||||||
Checkpoints can be pickled and sent to remote processes.
|
Checkpoints can be pickled and sent to remote processes.
|
||||||
Please note that checkpoints pointing to local directories will be
|
Please note that checkpoints pointing to local directories will be
|
||||||
|
|
Loading…
Add table
Reference in a new issue