mirror of
https://github.com/vale981/ray
synced 2025-03-06 02:21:39 -05:00
[air] Remove checkpoint user guide and update key concepts and docstring (#27455)
This commit is contained in:
parent
8d5c07b781
commit
b2cd34cc5c
8 changed files with 125 additions and 308 deletions
|
@ -15,7 +15,6 @@ parts:
|
|||
- file: ray-air/user-guides
|
||||
sections:
|
||||
- file: ray-air/preprocessors
|
||||
- file: ray-air/checkpoints
|
||||
- file: ray-air/check-ingest
|
||||
- file: ray-air/trainer
|
||||
- file: ray-air/tuner
|
||||
|
|
|
@ -1,89 +0,0 @@
|
|||
.. _air-checkpoints-doc:
|
||||
|
||||
Using Checkpoints
|
||||
=================
|
||||
|
||||
The AIR trainers, tuners, and custom pretrained model generate Checkpoints. An AIR Checkpoint is a common format for models that
|
||||
are used across different components of the Ray AI Runtime. This common format allow easy interoperability among AIR components
|
||||
and seamless integration with external supported machine learning frameworks.
|
||||
|
||||
.. image:: images/checkpoints.jpg
|
||||
|
||||
What is a checkpoint?
|
||||
---------------------
|
||||
|
||||
A Checkpoint object is a serializable reference to a model. A model can be represented in one of three ways:
|
||||
|
||||
- as a directory on local (on-disk) storage
|
||||
- as a directory on an external storage (e.g., cloud storage)
|
||||
- as an in-memory dictionary
|
||||
|
||||
Because of these different model storage representation, Checkpoint models provide useful flexibility in
|
||||
distributed environments, where you want to recreate an instance of the same model on multiple nodes or
|
||||
across different Ray clusters.
|
||||
|
||||
How to create a checkpoint?
|
||||
---------------------------
|
||||
|
||||
There are two ways to generate a checkpoint.
|
||||
|
||||
The first way is to generate it from a pretrained model. Each AIR supported machine learning (ML) framework has
|
||||
a ``Checkpoint`` method that can be used to generate an AIR checkpoint:
|
||||
|
||||
.. literalinclude:: doc_code/checkpoint_usage.py
|
||||
:language: python
|
||||
:start-after: __checkpoint_quick_start__
|
||||
:end-before: __checkpoint_quick_end__
|
||||
|
||||
|
||||
Another way is to retrieve it from the result object returned by a Trainer or Tuner.
|
||||
|
||||
.. literalinclude:: doc_code/checkpoint_usage.py
|
||||
:language: python
|
||||
:start-after: __use_trainer_checkpoint_start__
|
||||
:end-before: __use_trainer_checkpoint_end__
|
||||
|
||||
How to use a checkpoint?
|
||||
------------------------
|
||||
|
||||
Checkpoints can be used to instantiate a :class:`Predictor`, :class:`BatchPredictor`, or :class:`PredictorDeployment` class.
|
||||
An instance of this instantiated class (in memory) can be used for inference.
|
||||
|
||||
For instance, the code example below shows how a checkpoint in the :class:`BatchPredictor` is used for scalable batch inference:
|
||||
|
||||
.. literalinclude:: doc_code/checkpoint_usage.py
|
||||
:language: python
|
||||
:start-after: __batch_pred_start__
|
||||
:end-before: __batch_pred_end__
|
||||
|
||||
Another example below demonstrates how to use a checkpoint for an online inference via :class:`PredictorDeployment`:
|
||||
|
||||
.. literalinclude:: doc_code/checkpoint_usage.py
|
||||
:language: python
|
||||
:start-after: __online_inference_start__
|
||||
:end-before: __online_inference_end__
|
||||
|
||||
Furthermore, a Checkpoint object has methods to translate between different checkpoint storage locations.
|
||||
With this flexibility, Checkpoint objects can be serialized and used in different contexts
|
||||
(e.g., on a different process or a different machine):
|
||||
|
||||
.. literalinclude:: doc_code/checkpoint_usage.py
|
||||
:language: python
|
||||
:start-after: __basic_checkpoint_start__
|
||||
:end-before: __basic_checkpoint_end__
|
||||
|
||||
|
||||
Example: Using Checkpoints with MLflow
|
||||
--------------------------------------
|
||||
|
||||
`MLflow <https://mlflow.org/>`__ has its own `checkpoint format <https://www.mlflow.org/docs/latest/models.html>`__ called
|
||||
the "MLflow Model." It is a standard format to package machine learning models that can be used in a variety of downstream tools.
|
||||
|
||||
Below is an example of using MLflow models as a Ray AIR Checkpoint.
|
||||
|
||||
.. literalinclude:: doc_code/checkpoint_mlflow.py
|
||||
:language: python
|
||||
:start-after: __mlflow_checkpoint_start__
|
||||
:end-before: __mlflow_checkpoint_end__
|
||||
|
||||
|
|
@ -65,6 +65,37 @@ best_result = result_grid.get_best_result()
|
|||
print(best_result)
|
||||
# __air_tuner_end__
|
||||
|
||||
# __air_checkpoints_start__
|
||||
checkpoint = result.checkpoint
|
||||
print(checkpoint)
|
||||
# Checkpoint(local_path=..../checkpoint_000005)
|
||||
|
||||
tuned_checkpoint = result_grid.get_best_result().checkpoint
|
||||
print(tuned_checkpoint)
|
||||
# Checkpoint(local_path=..../checkpoint_000005)
|
||||
# __air_checkpoints_end__
|
||||
|
||||
# __checkpoint_adhoc_start__
|
||||
from ray.train.tensorflow import TensorflowCheckpoint
|
||||
import tensorflow as tf
|
||||
|
||||
# This can be a trained model.
|
||||
def build_model() -> tf.keras.Model:
|
||||
model = tf.keras.Sequential(
|
||||
[
|
||||
tf.keras.layers.InputLayer(input_shape=(1,)),
|
||||
tf.keras.layers.Dense(1),
|
||||
]
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
model = build_model()
|
||||
|
||||
checkpoint = TensorflowCheckpoint.from_model(model)
|
||||
# __checkpoint_adhoc_end__
|
||||
|
||||
|
||||
# __air_batch_predictor_start__
|
||||
from ray.train.batch_predictor import BatchPredictor
|
||||
from ray.train.xgboost import XGBoostPredictor
|
||||
|
|
|
@ -1,122 +0,0 @@
|
|||
# flake8: noqa
|
||||
# isort: skip_file
|
||||
|
||||
# __checkpoint_quick_start__
|
||||
from ray.train.tensorflow import TensorflowCheckpoint
|
||||
import tensorflow as tf
|
||||
|
||||
# This can be a trained model.
|
||||
def build_model() -> tf.keras.Model:
|
||||
model = tf.keras.Sequential(
|
||||
[
|
||||
tf.keras.layers.InputLayer(input_shape=(1,)),
|
||||
tf.keras.layers.Dense(1),
|
||||
]
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
model = build_model()
|
||||
|
||||
checkpoint = TensorflowCheckpoint.from_model(model)
|
||||
# __checkpoint_quick_end__
|
||||
|
||||
|
||||
# __use_trainer_checkpoint_start__
|
||||
import ray
|
||||
from ray.train.xgboost import XGBoostTrainer
|
||||
from ray.air.config import ScalingConfig
|
||||
|
||||
|
||||
dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
|
||||
|
||||
# Split data into train and validation.
|
||||
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
|
||||
|
||||
trainer = XGBoostTrainer(
|
||||
scaling_config=ScalingConfig(num_workers=2),
|
||||
label_column="target",
|
||||
params={
|
||||
"objective": "binary:logistic",
|
||||
"eval_metric": ["logloss", "error"],
|
||||
},
|
||||
datasets={"train": train_dataset},
|
||||
num_boost_round=5,
|
||||
)
|
||||
|
||||
result = trainer.fit()
|
||||
checkpoint = result.checkpoint
|
||||
# __use_trainer_checkpoint_end__
|
||||
|
||||
# __batch_pred_start__
|
||||
from ray.train.batch_predictor import BatchPredictor
|
||||
from ray.train.xgboost import XGBoostPredictor
|
||||
|
||||
# Create a test dataset by dropping the target column.
|
||||
test_dataset = valid_dataset.drop_columns(["target"])
|
||||
|
||||
batch_predictor = BatchPredictor.from_checkpoint(checkpoint, XGBoostPredictor)
|
||||
|
||||
# Bulk batch prediction.
|
||||
batch_predictor.predict(test_dataset)
|
||||
# __batch_pred_end__
|
||||
|
||||
|
||||
# __online_inference_start__
|
||||
import requests
|
||||
from fastapi import Request
|
||||
import pandas as pd
|
||||
|
||||
from ray import serve
|
||||
from ray.serve import PredictorDeployment
|
||||
from ray.serve.http_adapters import json_request
|
||||
|
||||
|
||||
async def adapter(request: Request):
|
||||
content = await request.json()
|
||||
print(content)
|
||||
return pd.DataFrame.from_dict(content)
|
||||
|
||||
|
||||
serve.start(detached=True)
|
||||
deployment = PredictorDeployment.options(name="XGBoostService")
|
||||
|
||||
deployment.deploy(
|
||||
XGBoostPredictor, checkpoint, batching_params=False, http_adapter=adapter
|
||||
)
|
||||
|
||||
print(deployment.url)
|
||||
|
||||
sample_input = test_dataset.take(1)
|
||||
sample_input = dict(sample_input[0])
|
||||
|
||||
output = requests.post(deployment.url, json=[sample_input]).json()
|
||||
print(output)
|
||||
# __online_inference_end__
|
||||
|
||||
# __basic_checkpoint_start__
|
||||
from ray.air.checkpoint import Checkpoint
|
||||
|
||||
# Create checkpoint data dict
|
||||
checkpoint_data = {"data": 123}
|
||||
|
||||
# Create checkpoint object from data
|
||||
checkpoint = Checkpoint.from_dict(checkpoint_data)
|
||||
|
||||
# Save checkpoint to a directory on the file system.
|
||||
path = checkpoint.to_directory()
|
||||
|
||||
# This path can then be passed around,
|
||||
# # e.g. to a different function or a different script.
|
||||
# You can also use `checkpoint.to_uri/from_uri` to
|
||||
# read from/write to cloud storage
|
||||
|
||||
# In another function or script, recover Checkpoint object from path
|
||||
checkpoint = Checkpoint.from_directory(path)
|
||||
|
||||
# Convert into dictionary again
|
||||
recovered_data = checkpoint.to_dict()
|
||||
|
||||
# It is guaranteed that the original data has been recovered
|
||||
assert recovered_data == checkpoint_data
|
||||
# __basic_checkpoint_end__
|
Binary file not shown.
Before Width: | Height: | Size: 22 KiB |
|
@ -43,9 +43,8 @@ See the documentation on :ref:`Trainers <air-trainers>`.
|
|||
:start-after: __air_trainer_start__
|
||||
:end-before: __air_trainer_end__
|
||||
|
||||
|
||||
|
||||
Trainer objects will produce a :ref:`Result <air-results-ref>` object after calling ``.fit()``. These objects will contain training metrics as long as checkpoints to retrieve the best model.
|
||||
Trainer objects produce a :ref:`Result <air-results-ref>` object after calling ``.fit()``.
|
||||
These objects contain training metrics as well as checkpoints to retrieve the best model.
|
||||
|
||||
.. literalinclude:: doc_code/air_key_concepts.py
|
||||
:language: python
|
||||
|
@ -65,11 +64,40 @@ Tuners can work seamlessly with any Trainer but also can support arbitrary train
|
|||
:start-after: __air_tuner_start__
|
||||
:end-before: __air_tuner_end__
|
||||
|
||||
.. _air-checkpoints-doc:
|
||||
|
||||
Checkpoints
|
||||
-----------
|
||||
|
||||
The AIR trainers, tuners, and custom pretrained model generate :class:`a framework-specific Checkpoint <ray.air.Checkpoint>` object.
|
||||
Checkpoints are a common interface for models that are used across different AIR components and libraries.
|
||||
|
||||
There are two main ways to generate a checkpoint.
|
||||
|
||||
Checkpoint objects can be retrieved from the Result object returned by a Trainer or Tuner ``.fit()`` call.
|
||||
|
||||
.. literalinclude:: doc_code/air_key_concepts.py
|
||||
:language: python
|
||||
:start-after: __air_checkpoints_start__
|
||||
:end-before: __air_checkpoints_end__
|
||||
|
||||
You can also generate a checkpoint from a pretrained model. Each AIR supported machine learning (ML) framework has
|
||||
a ``Checkpoint`` object that can be used to generate an AIR checkpoint:
|
||||
|
||||
.. literalinclude:: doc_code/air_key_concepts.py
|
||||
:language: python
|
||||
:start-after: __checkpoint_adhoc_start__
|
||||
:end-before: __checkpoint_adhoc_end__
|
||||
|
||||
|
||||
Checkpoints can be used to instantiate a :class:`Predictor`, :class:`BatchPredictor`, or :class:`PredictorDeployment` classes,
|
||||
as seen below.
|
||||
|
||||
|
||||
Batch Predictor
|
||||
---------------
|
||||
|
||||
You can take a trained model and do batch inference using the BatchPredictor object.
|
||||
You can take a checkpoint and do batch inference using the BatchPredictor object.
|
||||
|
||||
.. literalinclude:: doc_code/air_key_concepts.py
|
||||
:language: python
|
||||
|
|
|
@ -23,17 +23,6 @@ AIR User Guides
|
|||
:text: Using Preprocessors
|
||||
:classes: btn-link btn-block stretched-link
|
||||
|
||||
|
||||
---
|
||||
:img-top: /ray-overview/images/ray_svg_logo.svg
|
||||
|
||||
+++
|
||||
.. link-button:: /ray-air/checkpoints
|
||||
:type: ref
|
||||
:text: Using Checkpoints
|
||||
:classes: btn-link btn-block stretched-link
|
||||
|
||||
|
||||
---
|
||||
:img-top: /ray-overview/images/ray_svg_logo.svg
|
||||
|
||||
|
|
|
@ -42,34 +42,20 @@ logger = logging.getLogger(__name__)
|
|||
class Checkpoint:
|
||||
"""Ray AIR Checkpoint.
|
||||
|
||||
This implementation provides methods to translate between
|
||||
different checkpoint storage locations: Local storage, external storage
|
||||
(e.g. cloud storage), and data dict representations.
|
||||
An AIR Checkpoint are a common interface for accessing models across
|
||||
different AIR components and libraries. A Checkpoint can have its data
|
||||
represented in one of three ways:
|
||||
|
||||
The constructor is a private API, instead the ``from_`` methods should
|
||||
be used to create checkpoint objects
|
||||
(e.g. ``Checkpoint.from_directory()``).
|
||||
- as a directory on local (on-disk) storage
|
||||
- as a directory on an external storage (e.g., cloud storage)
|
||||
- as an in-memory dictionary
|
||||
|
||||
When converting between different checkpoint formats, it is guaranteed
|
||||
that a full round trip of conversions (e.g. directory --> dict -->
|
||||
obj ref --> directory) will recover the original checkpoint data.
|
||||
There are no guarantees made about compatibility of intermediate
|
||||
representations.
|
||||
The Checkpoint object also has methods to translate between different checkpoint
|
||||
storage locations. These storage representations provide flexibility in
|
||||
distributed environments, where you may want to recreate an instance of
|
||||
the same model on multiple nodes or across different Ray clusters.
|
||||
|
||||
New data can be added to a Checkpoint during conversion. Consider the
|
||||
following conversion: directory --> dict (adding dict["foo"] = "bar")
|
||||
--> directory --> dict (expect to see dict["foo"] = "bar"). Note that
|
||||
the second directory will contain pickle files with the serialized additional
|
||||
field data in them.
|
||||
|
||||
Similarly with a dict as a source: dict --> directory (add file "foo.txt")
|
||||
--> dict --> directory (will have "foo.txt" in it again). Note that the second
|
||||
dict representation will contain an extra field with the serialized additional
|
||||
files in it.
|
||||
|
||||
Examples:
|
||||
|
||||
Example for an arbitrary data checkpoint:
|
||||
Example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -81,12 +67,15 @@ class Checkpoint:
|
|||
# Create checkpoint object from data
|
||||
checkpoint = Checkpoint.from_dict(checkpoint_data)
|
||||
|
||||
# Save checkpoint to temporary location
|
||||
# Save checkpoint to a directory on the file system.
|
||||
path = checkpoint.to_directory()
|
||||
|
||||
# This path can then be passed around, e.g. to a different function
|
||||
# This path can then be passed around,
|
||||
# # e.g. to a different function or a different script.
|
||||
# You can also use `checkpoint.to_uri/from_uri` to
|
||||
# read from/write to cloud storage
|
||||
|
||||
# At some other location, recover Checkpoint object from path
|
||||
# In another function or script, recover Checkpoint object from path
|
||||
checkpoint = Checkpoint.from_directory(path)
|
||||
|
||||
# Convert into dictionary again
|
||||
|
@ -95,39 +84,31 @@ class Checkpoint:
|
|||
# It is guaranteed that the original data has been recovered
|
||||
assert recovered_data == checkpoint_data
|
||||
|
||||
Example using MLflow for saving and loading a classifier:
|
||||
Checkpoints can be used to instantiate a :class:`Predictor`,
|
||||
:class:`BatchPredictor`, or :class:`PredictorDeployment` class.
|
||||
|
||||
.. code-block:: python
|
||||
The constructor is a private API, instead the ``from_`` methods should
|
||||
be used to create checkpoint objects
|
||||
(e.g. ``Checkpoint.from_directory()``).
|
||||
|
||||
from ray.air.checkpoint import Checkpoint
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
import mlflow.sklearn
|
||||
*Other implementation notes:*
|
||||
When converting between different checkpoint formats, it is guaranteed
|
||||
that a full round trip of conversions (e.g. directory --> dict -->
|
||||
obj ref --> directory) will recover the original checkpoint data.
|
||||
There are no guarantees made about compatibility of intermediate
|
||||
representations.
|
||||
|
||||
# Create an sklearn classifier
|
||||
clf = RandomForestClassifier(max_depth=7, random_state=0)
|
||||
# ... e.g. train model with clf.fit()
|
||||
# Save model using MLflow
|
||||
mlflow.sklearn.save_model(clf, "model_directory")
|
||||
New data can be added to a Checkpoint
|
||||
during conversion. Consider the following conversion:
|
||||
directory --> dict (adding dict["foo"] = "bar")
|
||||
--> directory --> dict (expect to see dict["foo"] = "bar"). Note that
|
||||
the second directory will contain pickle files with the serialized additional
|
||||
field data in them.
|
||||
|
||||
# Create checkpoint object from path
|
||||
checkpoint = Checkpoint.from_directory("model_directory")
|
||||
|
||||
# Convert into dictionary
|
||||
checkpoint_dict = checkpoint.to_dict()
|
||||
|
||||
# This dict can then be passed around, e.g. to a different function
|
||||
|
||||
# At some other location, recover checkpoint object from dict
|
||||
checkpoint = Checkpoint.from_dict(checkpoint_dict)
|
||||
|
||||
# Convert into a directory again
|
||||
checkpoint.to_directory("other_directory")
|
||||
|
||||
# We can now use MLflow to re-load the model
|
||||
clf = mlflow.sklearn.load_model("other_directory")
|
||||
|
||||
# It is guaranteed that the original data was recovered
|
||||
assert isinstance(clf, RandomForestClassifier)
|
||||
Similarly with a dict as a source: dict --> directory (add file "foo.txt")
|
||||
--> dict --> directory (will have "foo.txt" in it again). Note that the second
|
||||
dict representation will contain an extra field with the serialized additional
|
||||
files in it.
|
||||
|
||||
Checkpoints can be pickled and sent to remote processes.
|
||||
Please note that checkpoints pointing to local directories will be
|
||||
|
|
Loading…
Add table
Reference in a new issue