mirror of
https://github.com/vale981/ray
synced 2025-03-06 02:21:39 -05:00
341 lines
12 KiB
ReStructuredText
341 lines
12 KiB
ReStructuredText
.. _air-preprocessors:
|
|
|
|
Using Preprocessors
|
|
===================
|
|
|
|
Data preprocessing is a common technique for transforming raw data into features for a machine learning model.
|
|
In general, you may want to apply the same preprocessing logic to your offline training data and online inference data.
|
|
Ray AIR provides several common preprocessors out of the box and interfaces to define your own custom logic.
|
|
|
|
.. https://docs.google.com/drawings/d/1ZIbsXv5vvwTVIEr2aooKxuYJ_VL7-8VMNlRinAiPaTI/edit
|
|
|
|
.. image:: images/preprocessors.svg
|
|
|
|
|
|
Overview
|
|
--------
|
|
|
|
The most common way of using a preprocessor is by passing it as an argument to the constructor of a :ref:`Trainer <air-trainers>` in conjunction with a :ref:`Ray Dataset <datasets>`.
|
|
For example, the following code trains a model with a preprocessor that normalizes the data.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __trainer_start__
|
|
:end-before: __trainer_end__
|
|
|
|
The ``Preprocessor`` class with four public methods that can we used separately from a trainer:
|
|
|
|
#. ``fit()``: Compute state information about a :class:`Dataset <ray.data.Dataset>` (e.g., the mean or standard deviation of a column)
|
|
and save it to the ``Preprocessor``. This information is used to perform ``transform()``, and the method is typically called on a
|
|
training dataset.
|
|
#. ``transform()``: Apply a transformation to a ``Dataset``.
|
|
If the ``Preprocessor`` is stateful, then ``fit()`` must be called first. This method is typically called on training,
|
|
validation, and test datasets.
|
|
#. ``transform_batch()``: Apply a transformation to a single :class:`batch <ray.train.predictor.DataBatchType>` of data. This method is typically called on online or offline inference data.
|
|
#. ``fit_transform()``: Syntactic sugar for calling both ``fit()`` and ``transform()`` on a ``Dataset``.
|
|
|
|
To show these methods in action, let's walk through a basic example. First, we'll set up two simple Ray ``Dataset``\s.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __preprocessor_setup_start__
|
|
:end-before: __preprocessor_setup_end__
|
|
|
|
Next, ``fit`` the ``Preprocessor`` on one ``Dataset``, and then ``transform`` both ``Dataset``\s with this fitted information.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __preprocessor_fit_transform_start__
|
|
:end-before: __preprocessor_fit_transform_end__
|
|
|
|
Finally, call ``transform_batch`` on a single batch of data.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __preprocessor_transform_batch_start__
|
|
:end-before: __preprocessor_transform_batch_end__
|
|
|
|
Life of an AIR preprocessor
|
|
---------------------------
|
|
|
|
Now that we've gone over the basics, let's dive into how ``Preprocessor``\s fit into an end-to-end application built with AIR.
|
|
The diagram below depicts an overview of the main steps of a ``Preprocessor``:
|
|
|
|
#. Passed into a ``Trainer`` to ``fit`` and ``transform`` input ``Dataset``\s
|
|
#. Saved as a ``Checkpoint``
|
|
#. Reconstructed in a ``Predictor`` to ``fit_batch`` on batches of data
|
|
|
|
.. figure:: images/air-preprocessor.svg
|
|
|
|
Throughout this section we'll go through this workflow in more detail, with code examples using XGBoost.
|
|
The same logic is applicable to other machine learning framework integrations as well.
|
|
|
|
Trainer
|
|
~~~~~~~
|
|
|
|
The journey of the ``Preprocessor`` starts with the :class:`Trainer <ray.train.trainer.BaseTrainer>`.
|
|
If the ``Trainer`` is instantiated with a ``Preprocessor``, then the following logic is executed when ``Trainer.fit()`` is called:
|
|
|
|
#. If a ``"train"`` ``Dataset`` is passed in, then the ``Preprocessor`` calls ``fit()`` on it.
|
|
#. The ``Preprocessor`` then calls ``transform()`` on all ``Dataset``\s, including the ``"train"`` ``Dataset``.
|
|
#. The ``Trainer`` then performs training on the preprocessed ``Dataset``\s.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __trainer_start__
|
|
:end-before: __trainer_end__
|
|
|
|
.. note::
|
|
|
|
If you're passing a ``Preprocessor`` that is already fitted, it is refitted on the ``"train"`` ``Dataset``.
|
|
Adding the functionality to support passing in a fitted Preprocessor is being tracked
|
|
`here <https://github.com/ray-project/ray/issues/25299>`__.
|
|
|
|
.. TODO: Remove the note above once the issue is resolved.
|
|
|
|
Tune
|
|
~~~~
|
|
|
|
If you're using ``Ray Tune`` for hyperparameter optimization, be aware that each ``Trial`` instantiates its own copy of
|
|
the ``Preprocessor`` and the fitting and transforming logic occur once per ``Trial``.
|
|
|
|
Checkpoint
|
|
~~~~~~~~~~
|
|
|
|
``Trainer.fit()`` returns a ``Result`` object which contains a ``Checkpoint``.
|
|
If a ``Preprocessor`` is passed into the ``Trainer``, then it is saved in the ``Checkpoint`` along with any fitted state.
|
|
|
|
As a sanity check, let's confirm the ``Preprocessor`` is available in the ``Checkpoint``. In practice, you don't need to check.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __checkpoint_start__
|
|
:end-before: __checkpoint_end__
|
|
|
|
|
|
Predictor
|
|
~~~~~~~~~
|
|
|
|
A ``Predictor`` can be constructed from a saved ``Checkpoint``. If the ``Checkpoint`` contains a ``Preprocessor``,
|
|
then the ``Preprocessor`` calls ``transform_batch`` on input batches prior to performing inference.
|
|
|
|
In the following example, we show the Batch Predictor flow. The same logic applies to the :ref:`Online Inference flow <air-key-concepts-online-inference>`.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __predictor_start__
|
|
:end-before: __predictor_end__
|
|
|
|
Types of preprocessors
|
|
----------------------
|
|
|
|
Built-in preprocessors
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Ray AIR provides a handful of preprocessors out of the box.
|
|
|
|
**Generic preprocessors**
|
|
|
|
.. autosummary::
|
|
:nosignatures:
|
|
|
|
ray.data.preprocessors.BatchMapper
|
|
ray.data.preprocessors.Chain
|
|
ray.data.preprocessors.Concatenator
|
|
ray.data.preprocessor.Preprocessor
|
|
ray.data.preprocessors.SimpleImputer
|
|
|
|
**Categorical encoders**
|
|
|
|
.. autosummary::
|
|
:nosignatures:
|
|
|
|
ray.data.preprocessors.Categorizer
|
|
ray.data.preprocessors.LabelEncoder
|
|
ray.data.preprocessors.MultiHotEncoder
|
|
ray.data.preprocessors.OneHotEncoder
|
|
ray.data.preprocessors.OrdinalEncoder
|
|
|
|
**Feature scalers**
|
|
|
|
.. autosummary::
|
|
:nosignatures:
|
|
|
|
ray.data.preprocessors.MaxAbsScaler
|
|
ray.data.preprocessors.MinMaxScaler
|
|
ray.data.preprocessors.Normalizer
|
|
ray.data.preprocessors.PowerTransformer
|
|
ray.data.preprocessors.RobustScaler
|
|
ray.data.preprocessors.StandardScaler
|
|
|
|
**Text encoders**
|
|
|
|
.. autosummary::
|
|
:nosignatures:
|
|
|
|
ray.data.preprocessors.CountVectorizer
|
|
ray.data.preprocessors.HashingVectorizer
|
|
ray.data.preprocessors.Tokenizer
|
|
ray.data.preprocessors.FeatureHasher
|
|
|
|
**Utilities**
|
|
|
|
.. autosummary::
|
|
:nosignatures:
|
|
|
|
ray.data.Dataset.train_test_split
|
|
|
|
Which preprocessor should you use?
|
|
----------------------------------
|
|
|
|
The type of preprocessor you use depends on what your data looks like. This section
|
|
provides tips on handling common data formats.
|
|
|
|
Categorical data
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
Most models expect numerical inputs. To represent your categorical data in a way your
|
|
model can understand, encode categories using one of the preprocessors described below.
|
|
|
|
.. list-table::
|
|
:header-rows: 1
|
|
|
|
* - Categorical Data Type
|
|
- Example
|
|
- Preprocessor
|
|
* - Labels
|
|
- ``"cat"``, ``"dog"``, ``"airplane"``
|
|
- :class:`~ray.data.preprocessors.LabelEncoder`
|
|
* - Ordered categories
|
|
- ``"bs"``, ``"md"``, ``"phd"``
|
|
- :class:`~ray.data.preprocessors.OrdinalEncoder`
|
|
* - Unordered categories
|
|
- ``"red"``, ``"green"``, ``"blue"``
|
|
- :class:`~ray.data.preprocessors.OneHotEncoder`
|
|
* - Lists of categories
|
|
- ``("sci-fi", "action")``, ``("action", "comedy", "animated")``
|
|
- :class:`~ray.data.preprocessors.MultiHotEncoder`
|
|
|
|
.. note::
|
|
If you're using LightGBM, you don't need to encode your categorical data. Instead,
|
|
use :class:`~ray.data.preprocessors.Categorizer` to convert your data to
|
|
`pandas.CategoricalDtype`.
|
|
|
|
Numerical data
|
|
~~~~~~~~~~~~~~
|
|
|
|
To ensure your models behaves properly, normalize your numerical data. Reference the
|
|
table below to determine which preprocessor to use.
|
|
|
|
.. list-table::
|
|
:header-rows: 1
|
|
|
|
* - Data Property
|
|
- Preprocessor
|
|
* - Your data is approximately normal
|
|
- :class:`~ray.data.preprocessors.StandardScaler`
|
|
* - Your data is sparse
|
|
- :class:`~ray.data.preprocessors.MaxAbsScaler`
|
|
* - Your data contains many outliers
|
|
- :class:`~ray.data.preprocessors.RobustScaler`
|
|
* - Your data isn't normal, but you need it to be
|
|
- :class:`~ray.data.preprocessors.PowerTransformer`
|
|
* - You need unit-norm rows
|
|
- :class:`~ray.data.preprocessors.Normalizer`
|
|
* - You aren't sure what your data looks like
|
|
- :class:`~ray.data.preprocessors.MinMaxScaler`
|
|
|
|
.. warning::
|
|
These preprocessors operate on numeric columns. If your dataset contains columns of
|
|
type :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`, you may need to
|
|
:ref:`implement a custom preprocessor <air-custom-preprocessors>`.
|
|
|
|
Additionally, if your model expects a tensor or ``ndarray``, create a tensor using
|
|
:class:`~ray.data.preprocessors.Concatenator`.
|
|
|
|
.. tip::
|
|
Built-in feature scalers like :class:`~ray.data.preprocessors.StandardScaler` don't
|
|
work on :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype` columns, so apply
|
|
:class:`~ray.data.preprocessors.Concatenator` after feature scaling. Combine feature
|
|
scaling and concatenation into a single preprocessor with
|
|
:class:`~ray.data.preprocessors.Chain`.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __concatenate_start__
|
|
:end-before: __concatenate_end__
|
|
|
|
Text data
|
|
~~~~~~~~~
|
|
|
|
A `document-term matrix <https://en.wikipedia.org/wiki/Document-term_matrix>`_ is a
|
|
table that describes text data, often used in natural language processing.
|
|
|
|
To generate a document-term matrix from a collection of documents, use
|
|
:class:`~ray.data.preprocessors.HashingVectorizer` or
|
|
:class:`~ray.data.preprocessors.CountVectorizer`. If you already know the frequency of
|
|
tokens and want to store the data in a document-term matrix, use
|
|
:class:`~ray.data.preprocessors.FeatureHasher`.
|
|
|
|
.. list-table::
|
|
:header-rows: 1
|
|
|
|
* - Requirement
|
|
- Preprocessor
|
|
* - You care about memory efficiency
|
|
- :class:`~ray.data.preprocessors.HashingVectorizer`
|
|
* - You care about model interpretability
|
|
- :class:`~ray.data.preprocessors.CountVectorizer`
|
|
|
|
|
|
Filling in missing values
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
If your dataset contains missing values, replace them with
|
|
:class:`~ray.data.preprocessors.SimpleImputer`.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __simple_imputer_start__
|
|
:end-before: __simple_imputer_end__
|
|
|
|
|
|
Chaining preprocessors
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
If you need to apply more than one preprocessor, compose them together with
|
|
:class:`~ray.data.preprocessors.Chain`.
|
|
|
|
:class:`~ray.data.preprocessors.Chain` applies ``fit`` and ``transform``
|
|
sequentially. For example, if you construct
|
|
``Chain(preprocessorA, preprocessorB)``, then ``preprocessorB.transform`` is applied
|
|
to the result of ``preprocessorA.transform``.
|
|
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __chain_start__
|
|
:end-before: __chain_end__
|
|
|
|
|
|
.. _air-custom-preprocessors:
|
|
|
|
Implementing custom preprocessors
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
If you want to implement a custom preprocessor that needs to be fit, extend the
|
|
:class:`~ray.data.preprocessor.Preprocessor` base class.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __custom_stateful_start__
|
|
:end-before: __custom_stateful_end__
|
|
|
|
If your preprocessor doesn't need to be fit, construct a
|
|
:class:`~ray.data.preprocessors.BatchMapper`.
|
|
:class:`~ray.data.preprocessors.BatchMapper` can drop, add, or modify columns.
|
|
|
|
.. literalinclude:: doc_code/preprocessors.py
|
|
:language: python
|
|
:start-after: __custom_stateless_start__
|
|
:end-before: __custom_stateless_end__
|