mirror of
synced 2025-03-11 21:56:39 -04:00

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
340 lines
12 KiB
340 lines
12 KiB
.. _air-preprocessors:
Using preprocessors
Data preprocessing is a common technique for transforming raw data into features for a machine learning model.
In general, you may want to apply the same preprocessing logic to your offline training data and online inference data.
Ray AIR provides several common preprocessors out of the box and interfaces to define your own custom logic.
The most common way of using a preprocessor is by passing it as an argument to the constructor of a :ref:`Trainer <air-trainers>` in conjunction with a :ref:`Ray Dataset <datasets>`.
For example, the following code trains a model with a preprocessor that normalizes the data.
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __trainer_start__
:end-before: __trainer_end__
The ``Preprocessor`` class with four public methods that can we used separately from a trainer:
#. ``fit()``: Compute state information about a :class:`Dataset <ray.data.Dataset>` (e.g., the mean or standard deviation of a column)
and save it to the ``Preprocessor``. This information is used to perform ``transform()``, and the method is typically called on a
training dataset.
#. ``transform()``: Apply a transformation to a ``Dataset``.
If the ``Preprocessor`` is stateful, then ``fit()`` must be called first. This method is typically called on training,
validation, and test datasets.
#. ``transform_batch()``: Apply a transformation to a single :class:`batch <ray.train.predictor.DataBatchType>` of data. This method is typically called on online or offline inference data.
#. ``fit_transform()``: Syntactic sugar for calling both ``fit()`` and ``transform()`` on a ``Dataset``.
To show these methods in action, let's walk through a basic example. First, we'll set up two simple Ray ``Dataset``\s.
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __preprocessor_setup_start__
:end-before: __preprocessor_setup_end__
Next, ``fit`` the ``Preprocessor`` on one ``Dataset``, and then ``transform`` both ``Dataset``\s with this fitted information.
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __preprocessor_fit_transform_start__
:end-before: __preprocessor_fit_transform_end__
Finally, call ``transform_batch`` on a single batch of data.
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __preprocessor_transform_batch_start__
:end-before: __preprocessor_transform_batch_end__
Life of an AIR preprocessor
Now that we've gone over the basics, let's dive into how ``Preprocessor``\s fit into an end-to-end application built with AIR.
The diagram below depicts an overview of the main steps of a ``Preprocessor``:
#. Passed into a ``Trainer`` to ``fit`` and ``transform`` input ``Dataset``\s
#. Saved as a ``Checkpoint``
#. Reconstructed in a ``Predictor`` to ``fit_batch`` on batches of data
.. figure:: images/air-preprocessor.svg
Throughout this section we'll go through this workflow in more detail, with code examples using XGBoost.
The same logic is applicable to other machine learning framework integrations as well.
The journey of the ``Preprocessor`` starts with the :class:`Trainer <ray.train.trainer.BaseTrainer>`.
If the ``Trainer`` is instantiated with a ``Preprocessor``, then the following logic is executed when ``Trainer.fit()`` is called:
#. If a ``"train"`` ``Dataset`` is passed in, then the ``Preprocessor`` calls ``fit()`` on it.
#. The ``Preprocessor`` then calls ``transform()`` on all ``Dataset``\s, including the ``"train"`` ``Dataset``.
#. The ``Trainer`` then performs training on the preprocessed ``Dataset``\s.
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __trainer_start__
:end-before: __trainer_end__
.. note::
If you're passing a ``Preprocessor`` that is already fitted, it is refitted on the ``"train"`` ``Dataset``.
Adding the functionality to support passing in a fitted Preprocessor is being tracked
`here <https://github.com/ray-project/ray/issues/25299>`__.
.. TODO: Remove the note above once the issue is resolved.
If you're using ``Ray Tune`` for hyperparameter optimization, be aware that each ``Trial`` instantiates its own copy of
the ``Preprocessor`` and the fitting and transforming logic occur once per ``Trial``.
``Trainer.fit()`` returns a ``Result`` object which contains a ``Checkpoint``.
If a ``Preprocessor`` is passed into the ``Trainer``, then it is saved in the ``Checkpoint`` along with any fitted state.
As a sanity check, let's confirm the ``Preprocessor`` is available in the ``Checkpoint``. In practice, you don't need to check.
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __checkpoint_start__
:end-before: __checkpoint_end__
A ``Predictor`` can be constructed from a saved ``Checkpoint``. If the ``Checkpoint`` contains a ``Preprocessor``,
then the ``Preprocessor`` calls ``transform_batch`` on input batches prior to performing inference.
In the following example, we show the Batch Predictor flow. The same logic applies to the :ref:`Online Inference flow <air-key-concepts-online-inference>`.
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __predictor_start__
:end-before: __predictor_end__
Types of preprocessors
Built-in preprocessors
Ray AIR provides a handful of preprocessors out of the box.
**Generic preprocessors**
.. autosummary::
**Categorical encoders**
.. autosummary::
**Feature scalers**
.. autosummary::
**Text encoders**
.. autosummary::
.. autosummary::
Which preprocessor should you use?
The type of preprocessor you use depends on what your data looks like. This section
provides tips on handling common data formats.
General-purpose preprocessors
There are many general-purpose preprocessors you can employ to transform
your data. For example, you can chain preprocessors, fill in missing values, or
implement custom preprocessors.
Chaining preprocessors
If you need to apply more than one preprocessor, compose them together with
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __chain_start__
:end-before: __chain_end__
.. tip::
:class:`~ray.data.preprocessors.Chain` applies ``fit`` and ``transform``
sequentially. For example, if you construct
``Chain(preprocessorA, preprocessorB)``, then ``preprocessorB.transform`` is applied
to the result of ``preprocessorA.transform``.
Filling in missing values
If your dataset contains missing values, replace them with
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __simple_imputer_start__
:end-before: __simple_imputer_end__
.. _air-custom-preprocessors:
Implementing custom preprocessors
If you want to implement a custom preprocessor that needs to be fit, extend the
:class:`~ray.data.preprocessor.Preprocessor` base class.
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __custom_stateful_start__
:end-before: __custom_stateful_end__
If your preprocessor doesn't need to be fit, construct a
:class:`~ray.data.preprocessors.BatchMapper` can drop, add, or modify columns.
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __custom_stateless_start__
:end-before: __custom_stateless_end__
Categorical data
Most models expect numerical inputs. To represent your categorical data in a way your
model can understand, encode categories using one of the preprocessors described below.
.. list-table::
:header-rows: 1
* - Categorical Data Type
- Example
- Preprocessor
* - Labels
- ``"cat"``, ``"dog"``, ``"airplane"``
- :class:`~ray.data.preprocessors.LabelEncoder`
* - Ordered categories
- ``"bs"``, ``"md"``, ``"phd"``
- :class:`~ray.data.preprocessors.OrdinalEncoder`
* - Unordered categories
- ``"red"``, ``"green"``, ``"blue"``
- :class:`~ray.data.preprocessors.OneHotEncoder`
* - Lists of categories
- ``("sci-fi", "action")``, ``("action", "comedy", "animated")``
- :class:`~ray.data.preprocessors.MultiHotEncoder`
.. note::
If you're using LightGBM, you don't need to encode your categorical data. Instead,
use :class:`~ray.data.preprocessors.Categorizer` to convert your data to
Numerical data
To ensure your models behaves properly, normalize your numerical data. Reference the
table below to determine which preprocessor to use.
.. list-table::
:header-rows: 1
* - Data Property
- Preprocessor
* - Your data is approximately normal
- :class:`~ray.data.preprocessors.StandardScaler`
* - Your data is sparse
- :class:`~ray.data.preprocessors.MaxAbsScaler`
* - Your data contains many outliers
- :class:`~ray.data.preprocessors.RobustScaler`
* - Your data isn't normal, but you need it to be
- :class:`~ray.data.preprocessors.PowerTransformer`
* - You need unit-norm rows
- :class:`~ray.data.preprocessors.Normalizer`
* - You aren't sure what your data looks like
- :class:`~ray.data.preprocessors.MinMaxScaler`
.. warning::
These preprocessors operate on numeric columns. If your dataset contains columns of
type :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`, you may need to
:ref:`implement a custom preprocessor <air-custom-preprocessors>`.
Additionally, if your model expects a tensor or ``ndarray``, create a tensor using
.. tip::
Built-in feature scalers like :class:`~ray.data.preprocessors.StandardScaler` don't
work on :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype` columns, so apply
:class:`~ray.data.preprocessors.Concatenator` after feature scaling. Combine feature
scaling and concatenation into a single preprocessor with
.. literalinclude:: doc_code/preprocessors.py
:language: python
:start-after: __concatenate_start__
:end-before: __concatenate_end__
Text Data
A `document-term matrix <https://en.wikipedia.org/wiki/Document-term_matrix>`_ is a
table that describes text data. It's useful for natural language processing.
To generate a document-term matrix from a collection of documents, use
:class:`~ray.data.preprocessors.HashingVectorizer` or
:class:`~ray.data.preprocessors.CountVectorizer`. If already know the frequency of
tokens and want to store the data in a document-term matrix, use
.. list-table::
:header-rows: 1
* - Requirement
- Preprocessor
* - You care about memory efficiency
- :class:`~ray.data.preprocessors.HashingVectorizer`
* - You care about model interpretability
- :class:`~ray.data.preprocessors.CountVectorizer`