[air] Remove checkpoint user guide and update key concepts and docstring (#27455)

2025-03-06 02:21:39 -05:00 · 2022-08-04 08:55:26 -07:00 · 2022-08-04 08:55:26 -07:00 · b2cd34cc5c
commit b2cd34cc5c
parent 8d5c07b781
8 changed files with 125 additions and 308 deletions
--- a/doc/source/_toc.yml
+++ b/doc/source/_toc.yml
@ -15,7 +15,6 @@ parts:
    - file: ray-air/user-guides
      sections:
        - file: ray-air/preprocessors
        - file: ray-air/checkpoints
        - file: ray-air/check-ingest
        - file: ray-air/trainer
        - file: ray-air/tuner
--- a/doc/source/ray-air/checkpoints.rst
+++ b/doc/source/ray-air/checkpoints.rst
@ -1,89 +0,0 @@
 .. _air-checkpoints-doc:
 Using Checkpoints
 =================
 The AIR trainers, tuners, and custom pretrained model generate Checkpoints. An AIR Checkpoint is a common format for models that
 are used across different components of the Ray AI Runtime. This common format allow easy interoperability among AIR components
 and seamless integration with external supported machine learning frameworks.
 .. image:: images/checkpoints.jpg
 What is a checkpoint?
 ---------------------
 A Checkpoint object is a serializable reference to a model. A model can be represented in one of three ways:
 - as a directory on local (on-disk) storage
 - as a directory on an external storage (e.g., cloud storage)
 - as an in-memory dictionary
 Because of these different model storage representation, Checkpoint models provide useful flexibility in
 distributed environments, where you want to recreate an instance of the same model on multiple nodes or
 across different Ray clusters.
 How to create a checkpoint?
 ---------------------------
 There are two ways to generate a checkpoint.
 The first way is to generate it from a pretrained model. Each AIR supported machine learning (ML) framework has
 a ``Checkpoint`` method that can be used to generate an AIR checkpoint:
 .. literalinclude:: doc_code/checkpoint_usage.py
    :language: python
    :start-after: __checkpoint_quick_start__
    :end-before: __checkpoint_quick_end__
 Another way is to retrieve it from the result object returned by a Trainer or Tuner.
 .. literalinclude:: doc_code/checkpoint_usage.py
    :language: python
    :start-after: __use_trainer_checkpoint_start__
    :end-before: __use_trainer_checkpoint_end__
 How to use a checkpoint?
 ------------------------
 Checkpoints can be used to instantiate a :class:`Predictor`, :class:`BatchPredictor`, or :class:`PredictorDeployment` class.
 An instance of this instantiated class (in memory) can be used for inference.
 For instance, the code example below shows how a checkpoint in the :class:`BatchPredictor` is used for scalable batch inference:
 .. literalinclude:: doc_code/checkpoint_usage.py
    :language: python
    :start-after: __batch_pred_start__
    :end-before: __batch_pred_end__
 Another example below demonstrates how to use a checkpoint for an online inference via :class:`PredictorDeployment`:
 .. literalinclude:: doc_code/checkpoint_usage.py
    :language: python
    :start-after: __online_inference_start__
    :end-before: __online_inference_end__
 Furthermore, a Checkpoint object has methods to translate between different checkpoint storage locations.
 With this flexibility, Checkpoint objects can be serialized and used in different contexts
 (e.g., on a different process or a different machine):
 .. literalinclude:: doc_code/checkpoint_usage.py
    :language: python
    :start-after: __basic_checkpoint_start__
    :end-before: __basic_checkpoint_end__
 Example: Using Checkpoints with MLflow
 --------------------------------------
 `MLflow <https://mlflow.org/>`__ has its own `checkpoint format <https://www.mlflow.org/docs/latest/models.html>`__ called
 the "MLflow Model." It is a standard format to package machine learning models that can be used in a variety of downstream tools.
 Below is an example of using MLflow models as a Ray AIR Checkpoint.
 .. literalinclude:: doc_code/checkpoint_mlflow.py
    :language: python
    :start-after: __mlflow_checkpoint_start__
    :end-before: __mlflow_checkpoint_end__
--- a/doc/source/ray-air/doc_code/air_key_concepts.py
+++ b/doc/source/ray-air/doc_code/air_key_concepts.py
@ -65,6 +65,37 @@ best_result = result_grid.get_best_result()
 print(best_result)
 # __air_tuner_end__
 # __air_checkpoints_start__
 checkpoint = result.checkpoint
 print(checkpoint)
 # Checkpoint(local_path=..../checkpoint_000005)
 tuned_checkpoint = result_grid.get_best_result().checkpoint
 print(tuned_checkpoint)
 # Checkpoint(local_path=..../checkpoint_000005)
 # __air_checkpoints_end__
 # __checkpoint_adhoc_start__
 from ray.train.tensorflow import TensorflowCheckpoint
 import tensorflow as tf
 # This can be a trained model.
 def build_model() -> tf.keras.Model:
    model = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(input_shape=(1,)),
            tf.keras.layers.Dense(1),
        ]
    )
    return model
 model = build_model()
 checkpoint = TensorflowCheckpoint.from_model(model)
 # __checkpoint_adhoc_end__
 # __air_batch_predictor_start__
 from ray.train.batch_predictor import BatchPredictor
 from ray.train.xgboost import XGBoostPredictor
--- a/doc/source/ray-air/doc_code/checkpoint_usage.py
+++ b/doc/source/ray-air/doc_code/checkpoint_usage.py
@ -1,122 +0,0 @@
 # flake8: noqa
 # isort: skip_file
 # __checkpoint_quick_start__
 from ray.train.tensorflow import TensorflowCheckpoint
 import tensorflow as tf
 # This can be a trained model.
 def build_model() -> tf.keras.Model:
    model = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(input_shape=(1,)),
            tf.keras.layers.Dense(1),
        ]
    )
    return model
 model = build_model()
 checkpoint = TensorflowCheckpoint.from_model(model)
 # __checkpoint_quick_end__
 # __use_trainer_checkpoint_start__
 import ray
 from ray.train.xgboost import XGBoostTrainer
 from ray.air.config import ScalingConfig
 dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
 # Split data into train and validation.
 train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
 trainer = XGBoostTrainer(
    scaling_config=ScalingConfig(num_workers=2),
    label_column="target",
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset},
    num_boost_round=5,
 )
 result = trainer.fit()
 checkpoint = result.checkpoint
 # __use_trainer_checkpoint_end__
 # __batch_pred_start__
 from ray.train.batch_predictor import BatchPredictor
 from ray.train.xgboost import XGBoostPredictor
 # Create a test dataset by dropping the target column.
 test_dataset = valid_dataset.drop_columns(["target"])
 batch_predictor = BatchPredictor.from_checkpoint(checkpoint, XGBoostPredictor)
 # Bulk batch prediction.
 batch_predictor.predict(test_dataset)
 # __batch_pred_end__
 # __online_inference_start__
 import requests
 from fastapi import Request
 import pandas as pd
 from ray import serve
 from ray.serve import PredictorDeployment
 from ray.serve.http_adapters import json_request
 async def adapter(request: Request):
    content = await request.json()
    print(content)
    return pd.DataFrame.from_dict(content)
 serve.start(detached=True)
 deployment = PredictorDeployment.options(name="XGBoostService")
 deployment.deploy(
    XGBoostPredictor, checkpoint, batching_params=False, http_adapter=adapter
 )
 print(deployment.url)
 sample_input = test_dataset.take(1)
 sample_input = dict(sample_input[0])
 output = requests.post(deployment.url, json=[sample_input]).json()
 print(output)
 # __online_inference_end__
 # __basic_checkpoint_start__
 from ray.air.checkpoint import Checkpoint
 # Create checkpoint data dict
 checkpoint_data = {"data": 123}
 # Create checkpoint object from data
 checkpoint = Checkpoint.from_dict(checkpoint_data)
 # Save checkpoint to a directory on the file system.
 path = checkpoint.to_directory()
 # This path can then be passed around,
 # # e.g. to a different function or a different script.
 # You can also use `checkpoint.to_uri/from_uri` to
 # read from/write to cloud storage
 # In another function or script, recover Checkpoint object from path
 checkpoint = Checkpoint.from_directory(path)
 # Convert into dictionary again
 recovered_data = checkpoint.to_dict()
 # It is guaranteed that the original data has been recovered
 assert recovered_data == checkpoint_data
 # __basic_checkpoint_end__
--- a/doc/source/ray-air/images/checkpoints.jpg
+++ b/doc/source/ray-air/images/checkpoints.jpg
--- a/doc/source/ray-air/key-concepts.rst
+++ b/doc/source/ray-air/key-concepts.rst
@ -43,9 +43,8 @@ See the documentation on :ref:`Trainers <air-trainers>`.
    :start-after: __air_trainer_start__
    :end-before: __air_trainer_end__
-
+Trainer objects produce a :ref:`Result <air-results-ref>` object after calling ``.fit()``.
-
+These objects contain training metrics as well as checkpoints to retrieve the best model.
 Trainer objects will produce a :ref:`Result <air-results-ref>` object after calling ``.fit()``.  These objects will contain training metrics as long as checkpoints to retrieve the best model.
 .. literalinclude:: doc_code/air_key_concepts.py
    :language: python
@ -65,11 +64,40 @@ Tuners can work seamlessly with any Trainer but also can support arbitrary train
    :start-after: __air_tuner_start__
    :end-before: __air_tuner_end__
 .. _air-checkpoints-doc:
 Checkpoints
 -----------
 The AIR trainers, tuners, and custom pretrained model generate :class:`a framework-specific Checkpoint <ray.air.Checkpoint>` object.
 Checkpoints are a common interface for models that are used across different AIR components and libraries.
 There are two main ways to generate a checkpoint.
 Checkpoint objects can be retrieved from the Result object returned by a Trainer or Tuner ``.fit()`` call.
 .. literalinclude:: doc_code/air_key_concepts.py
    :language: python
    :start-after: __air_checkpoints_start__
    :end-before: __air_checkpoints_end__
 You can also generate a checkpoint from a pretrained model. Each AIR supported machine learning (ML) framework has
 a ``Checkpoint`` object that can be used to generate an AIR checkpoint:
 .. literalinclude:: doc_code/air_key_concepts.py
    :language: python
    :start-after: __checkpoint_adhoc_start__
    :end-before: __checkpoint_adhoc_end__
 Checkpoints can be used to instantiate a :class:`Predictor`, :class:`BatchPredictor`, or :class:`PredictorDeployment` classes,
 as seen below.
 Batch Predictor
 ---------------
-You can take a trained model and do batch inference using the BatchPredictor object.
+You can take a checkpoint and do batch inference using the BatchPredictor object.
 .. literalinclude:: doc_code/air_key_concepts.py
    :language: python
--- a/doc/source/ray-air/user-guides.rst
+++ b/doc/source/ray-air/user-guides.rst
@ -23,17 +23,6 @@ AIR User Guides
        :text: Using Preprocessors
        :classes: btn-link btn-block stretched-link
    ---
    :img-top: /ray-overview/images/ray_svg_logo.svg
    +++
    .. link-button:: /ray-air/checkpoints
        :type: ref
        :text: Using Checkpoints
        :classes: btn-link btn-block stretched-link
    ---
    :img-top: /ray-overview/images/ray_svg_logo.svg
--- a/python/ray/air/checkpoint.py
+++ b/python/ray/air/checkpoint.py
@ -42,34 +42,20 @@ logger = logging.getLogger(__name__)
 class Checkpoint:
    """Ray AIR Checkpoint.
-    This implementation provides methods to translate between
+    An AIR Checkpoint are a common interface for accessing models across
-    different checkpoint storage locations: Local storage, external storage
+    different AIR components and libraries. A Checkpoint can have its data
-    (e.g. cloud storage), and data dict representations.
+    represented in one of three ways:
-    The constructor is a private API, instead the ``from_`` methods should
+    - as a directory on local (on-disk) storage
-    be used to create checkpoint objects
+    - as a directory on an external storage (e.g., cloud storage)
-    (e.g. ``Checkpoint.from_directory()``).
+    - as an in-memory dictionary
-    When converting between different checkpoint formats, it is guaranteed
+    The Checkpoint object also has methods to translate between different checkpoint
-    that a full round trip of conversions (e.g. directory --> dict -->
+    storage locations. These storage representations provide flexibility in
-    obj ref --> directory) will recover the original checkpoint data.
+    distributed environments, where you may want to recreate an instance of
-    There are no guarantees made about compatibility of intermediate
+    the same model on multiple nodes or across different Ray clusters.
    representations.
-    New data can be added to a Checkpoint during conversion. Consider the
+    Example:
    following conversion: directory --> dict (adding dict["foo"] = "bar")
    --> directory --> dict (expect to see dict["foo"] = "bar"). Note that
    the second directory will contain pickle files with the serialized additional
    field data in them.
    Similarly with a dict as a source: dict --> directory (add file "foo.txt")
    --> dict --> directory (will have "foo.txt" in it again). Note that the second
    dict representation will contain an extra field with the serialized additional
    files in it.
    Examples:
        Example for an arbitrary data checkpoint:
    .. code-block:: python
@ -81,12 +67,15 @@ class Checkpoint:
        # Create checkpoint object from data
        checkpoint = Checkpoint.from_dict(checkpoint_data)
-            # Save checkpoint to temporary location
+        # Save checkpoint to a directory on the file system.
        path = checkpoint.to_directory()
-            # This path can then be passed around, e.g. to a different function
+        # This path can then be passed around,
        # # e.g. to a different function or a different script.
        # You can also use `checkpoint.to_uri/from_uri` to
        # read from/write to cloud storage
-            # At some other location, recover Checkpoint object from path
+        # In another function or script, recover Checkpoint object from path
        checkpoint = Checkpoint.from_directory(path)
        # Convert into dictionary again
@ -95,39 +84,31 @@ class Checkpoint:
        # It is guaranteed that the original data has been recovered
        assert recovered_data == checkpoint_data
-        Example using MLflow for saving and loading a classifier:
+    Checkpoints can be used to instantiate a :class:`Predictor`,
    :class:`BatchPredictor`, or :class:`PredictorDeployment` class.
-        .. code-block:: python
+    The constructor is a private API, instead the ``from_`` methods should
    be used to create checkpoint objects
    (e.g. ``Checkpoint.from_directory()``).
-            from ray.air.checkpoint import Checkpoint
+    *Other implementation notes:*
-            from sklearn.ensemble import RandomForestClassifier
+    When converting between different checkpoint formats, it is guaranteed
-            import mlflow.sklearn
+    that a full round trip of conversions (e.g. directory --> dict -->
    obj ref --> directory) will recover the original checkpoint data.
    There are no guarantees made about compatibility of intermediate
    representations.
-            # Create an sklearn classifier
+    New data can be added to a Checkpoint
-            clf = RandomForestClassifier(max_depth=7, random_state=0)
+    during conversion. Consider the following conversion:
-            # ... e.g. train model with clf.fit()
+    directory --> dict (adding dict["foo"] = "bar")
-            # Save model using MLflow
+    --> directory --> dict (expect to see dict["foo"] = "bar"). Note that
-            mlflow.sklearn.save_model(clf, "model_directory")
+    the second directory will contain pickle files with the serialized additional
    field data in them.
-            # Create checkpoint object from path
+    Similarly with a dict as a source: dict --> directory (add file "foo.txt")
-            checkpoint = Checkpoint.from_directory("model_directory")
+    --> dict --> directory (will have "foo.txt" in it again). Note that the second
-
+    dict representation will contain an extra field with the serialized additional
-            # Convert into dictionary
+    files in it.
            checkpoint_dict = checkpoint.to_dict()
            # This dict can then be passed around, e.g. to a different function
            # At some other location, recover checkpoint object from dict
            checkpoint = Checkpoint.from_dict(checkpoint_dict)
            # Convert into a directory again
            checkpoint.to_directory("other_directory")
            # We can now use MLflow to re-load the model
            clf = mlflow.sklearn.load_model("other_directory")
            # It is guaranteed that the original data was recovered
            assert isinstance(clf, RandomForestClassifier)
    Checkpoints can be pickled and sent to remote processes.
    Please note that checkpoints pointing to local directories will be