2022-06-03 11:43:51 -07:00
.. _air-preprocessors:
2022-07-22 17:17:49 -07:00
Using Preprocessors
===================
2022-06-03 11:43:51 -07:00
2022-08-01 21:09:04 -07:00
Data preprocessing is a common technique for transforming raw data into features for a machine learning model.
2022-06-03 11:43:51 -07:00
In general, you may want to apply the same preprocessing logic to your offline training data and online inference data.
2022-08-01 21:09:04 -07:00
Ray AIR provides several common preprocessors out of the box and interfaces to define your own custom logic.
2022-06-03 11:43:51 -07:00
Overview
--------
2022-08-01 21:09:04 -07:00
Ray AIR exposes a `` Preprocessor `` class with four public methods for data preprocessing:
2022-06-03 11:43:51 -07:00
2022-08-01 21:09:04 -07:00
#. `` fit() `` : Compute state information about a :class: `Dataset <ray.data.Dataset>` (e.g., the mean or standard deviation of a column)
and save it to the `` Preprocessor `` . This information is used to perform `` transform() `` , and the method is typically called on a
training dataset.
2022-06-03 11:43:51 -07:00
#. `` transform() `` : Apply a transformation to a `` Dataset `` .
2022-08-01 21:09:04 -07:00
If the `` Preprocessor `` is stateful, then `` fit() `` must be called first. This method is typically called on training,
validation, and test datasets.
#. `` transform_batch() `` : Apply a transformation to a single :class: `batch <ray.train.predictor.DataBatchType>` of data. This method is typically called on online or offline inference data.
2022-06-03 11:43:51 -07:00
#. `` fit_transform() `` : Syntactic sugar for calling both `` fit() `` and `` transform() `` on a `` Dataset `` .
2022-08-01 21:09:04 -07:00
To show these methods in action, let's walk through a basic example. First, we'll set up two simple Ray `` Dataset ` ` \s.
2022-06-03 11:43:51 -07:00
.. literalinclude :: doc_code/preprocessors.py
:language: python
:start-after: __preprocessor_setup_start__
:end-before: __preprocessor_setup_end__
2022-08-01 21:09:04 -07:00
Next, `` fit `` the `` Preprocessor `` on one `` Dataset `` , and then `` transform `` both `` Dataset ` ` \s with this fitted information.
2022-06-03 11:43:51 -07:00
.. literalinclude :: doc_code/preprocessors.py
:language: python
:start-after: __preprocessor_fit_transform_start__
:end-before: __preprocessor_fit_transform_end__
Finally, call `` transform_batch `` on a single batch of data.
.. literalinclude :: doc_code/preprocessors.py
:language: python
:start-after: __preprocessor_transform_batch_start__
:end-before: __preprocessor_transform_batch_end__
Life of an AIR Preprocessor
---------------------------
Now that we've gone over the basics, let's dive into how `` Preprocessor ` ` \s fit into an end-to-end application built with AIR.
The diagram below depicts an overview of the main steps of a `` Preprocessor `` :
2022-08-01 21:09:04 -07:00
#. Passed into a `` Trainer `` to `` fit `` and `` transform `` input `` Dataset ` ` \s
#. Saved as a `` Checkpoint ``
#. Reconstructed in a `` Predictor `` to `` fit_batch `` on batches of data
2022-06-03 11:43:51 -07:00
.. figure :: images/air-preprocessor.svg
Throughout this section we'll go through this workflow in more detail, with code examples using XGBoost.
2022-08-01 21:09:04 -07:00
The same logic is applicable to other machine learning framework integrations as well.
2022-06-03 11:43:51 -07:00
Trainer
~~~~~~~
2022-06-08 21:34:18 -07:00
The journey of the `` Preprocessor `` starts with the :class: `Trainer <ray.train.trainer.BaseTrainer>` .
2022-08-01 21:09:04 -07:00
If the `` Trainer `` is instantiated with a `` Preprocessor `` , then the following logic is executed when `` Trainer.fit() `` is called:
2022-06-03 11:43:51 -07:00
2022-08-01 21:09:04 -07:00
#. If a `` "train" `` `` Dataset `` is passed in, then the `` Preprocessor `` calls `` fit() `` on it.
#. The `` Preprocessor `` then calls `` transform() `` on all `` Dataset ` ` \s, including the `` "train"`` ` ` Dataset `` .
#. The `` Trainer `` then performs training on the preprocessed `` Dataset ` ` \s.
2022-06-03 11:43:51 -07:00
.. literalinclude :: doc_code/preprocessors.py
:language: python
:start-after: __trainer_start__
:end-before: __trainer_end__
.. note ::
2022-08-01 21:09:04 -07:00
If you're passing a `` Preprocessor `` that is already fitted, it is refitted on the `` "train" `` `` Dataset `` .
2022-06-03 11:43:51 -07:00
Adding the functionality to support passing in a fitted Preprocessor is being tracked
`here <https://github.com/ray-project/ray/issues/25299> `__ .
.. TODO: Remove the note above once the issue is resolved.
Tune
~~~~
2022-08-01 21:09:04 -07:00
If you're using `` Ray Tune `` for hyperparameter optimization, be aware that each `` Trial `` instantiates its own copy of
the `` Preprocessor `` and the fitting and transforming logic occur once per `` Trial `` .
2022-06-03 11:43:51 -07:00
Checkpoint
~~~~~~~~~~
2022-07-17 01:51:17 +02:00
`` Trainer.fit() `` returns a `` Result `` object which contains a `` Checkpoint `` .
2022-08-01 21:09:04 -07:00
If a `` Preprocessor `` is passed into the `` Trainer `` , then it is saved in the `` Checkpoint `` along with any fitted state.
2022-06-03 11:43:51 -07:00
2022-08-01 21:09:04 -07:00
As a sanity check, let's confirm the `` Preprocessor `` is available in the `` Checkpoint `` . In practice, you don't need to check.
2022-06-03 11:43:51 -07:00
.. literalinclude :: doc_code/preprocessors.py
:language: python
:start-after: __checkpoint_start__
:end-before: __checkpoint_end__
Predictor
~~~~~~~~~
A `` Predictor `` can be constructed from a saved `` Checkpoint `` . If the `` Checkpoint `` contains a `` Preprocessor `` ,
2022-08-01 21:09:04 -07:00
then the `` Preprocessor `` calls `` transform_batch `` on input batches prior to performing inference.
2022-06-03 11:43:51 -07:00
In the following example, we show the Batch Predictor flow. The same logic applies to the :ref: `Online Inference flow <air-key-concepts-online-inference>` .
.. literalinclude :: doc_code/preprocessors.py
:language: python
:start-after: __predictor_start__
:end-before: __predictor_end__
Types of Preprocessors
----------------------
Basic Preprocessors
~~~~~~~~~~~~~~~~~~~
2022-08-01 21:09:04 -07:00
Ray AIR provides a handful of `` Preprocessor ` ` \s out of the box, and more will be added over time. We welcome
`Contributions <https://docs.ray.io/en/master/getting-involved.html> `__ !
2022-06-03 11:43:51 -07:00
.. tabbed :: Common APIs
2022-06-13 21:57:59 +02:00
#. :class: `Preprocessor <ray.data.preprocessor.Preprocessor>`
#. :class: `BatchMapper <ray.data.preprocessors.BatchMapper>`
#. :class: `Chain <ray.data.preprocessors.Chain>`
2022-06-03 11:43:51 -07:00
.. tabbed :: Tabular
2022-06-13 21:57:59 +02:00
#. :class: `Categorizer <ray.data.preprocessors.Categorizer>`
2022-07-14 10:26:14 -07:00
#. :class: `Concatenator <ray.data.preprocessors.Concatenator>`
2022-06-13 21:57:59 +02:00
#. :class: `FeatureHasher <ray.data.preprocessors.FeatureHasher>`
#. :class: `LabelEncoder <ray.data.preprocessors.LabelEncoder>`
#. :class: `MaxAbsScaler <ray.data.preprocessors.MaxAbsScaler>`
#. :class: `MinMaxScaler <ray.data.preprocessors.MinMaxScaler>`
#. :class: `Normalizer <ray.data.preprocessors.Normalizer>`
#. :class: `OneHotEncoder <ray.data.preprocessors.OneHotEncoder>`
#. :class: `OrdinalEncoder <ray.data.preprocessors.OrdinalEncoder>`
#. :class: `PowerTransformer <ray.data.preprocessors.PowerTransformer>`
#. :class: `RobustScaler <ray.data.preprocessors.RobustScaler>`
#. :class: `SimpleImputer <ray.data.preprocessors.SimpleImputer>`
#. :class: `StandardScaler <ray.data.preprocessors.StandardScaler>`
2022-06-03 11:43:51 -07:00
.. tabbed :: Text
2022-06-13 21:57:59 +02:00
#. :class: `CountVectorizer <ray.data.preprocessors.CountVectorizer>`
#. :class: `HashingVectorizer <ray.data.preprocessors.HashingVectorizer>`
#. :class: `Tokenizer <ray.data.preprocessors.Tokenizer>`
2022-06-03 11:43:51 -07:00
.. tabbed :: Image
Coming soon!
.. tabbed :: Utilities
2022-07-27 09:53:37 -07:00
#. :meth: `Dataset.train_test_split <ray.data.Dataset.train_test_split>`
2022-06-03 11:43:51 -07:00
Chaining Preprocessors
~~~~~~~~~~~~~~~~~~~~~~
2022-08-01 21:09:04 -07:00
More often than not, your preprocessing logic may contain multiple logical steps or apply different transformations to each column.
A simple `` Chain `` `` Preprocessor `` can be used to apply individual `` Preprocessor `` operations sequentially.
2022-06-03 11:43:51 -07:00
.. literalinclude :: doc_code/preprocessors.py
:language: python
:start-after: __chain_start__
:end-before: __chain_end__
.. tip ::
Keep in mind that the operations are sequential. For example, if you define a `` Preprocessor ``
2022-08-01 21:09:04 -07:00
`` Chain([preprocessorA, preprocessorB]) `` , then `` preprocessorB.transform() `` is applied
2022-06-03 11:43:51 -07:00
to the result of `` preprocessorA.transform() `` .
Custom Preprocessors
~~~~~~~~~~~~~~~~~~~~
**Stateless Preprocessors:** Stateless preprocessors can be implemented with the `` BatchMapper `` .
.. literalinclude :: doc_code/preprocessors.py
:language: python
:start-after: __custom_stateless_start__
:end-before: __custom_stateless_end__
2022-07-26 10:10:57 -07:00
**Stateful Preprocessors:** Stateful preprocessors can be implemented by extending the
:py:class: `~ray.data.preprocessor.Preprocessor` base class.
2022-06-03 11:43:51 -07:00
2022-06-09 16:54:46 -07:00
.. literalinclude :: doc_code/preprocessors.py
:language: python
:start-after: __custom_stateful_start__
:end-before: __custom_stateful_end__