.. _air-preprocessors: Using preprocessors =================== Data preprocessing is a common technique for transforming raw data into features for a machine learning model. In general, you may want to apply the same preprocessing logic to your offline training data and online inference data. Ray AIR provides several common preprocessors out of the box and interfaces to define your own custom logic. Overview -------- The most common way of using a preprocessor is by passing it as an argument to the constructor of a :ref:`Trainer ` in conjunction with a :ref:`Ray Dataset `. For example, the following code trains a model with a preprocessor that normalizes the data. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __trainer_start__ :end-before: __trainer_end__ The ``Preprocessor`` class with four public methods that can we used separately from a trainer: #. ``fit()``: Compute state information about a :class:`Dataset ` (e.g., the mean or standard deviation of a column) and save it to the ``Preprocessor``. This information is used to perform ``transform()``, and the method is typically called on a training dataset. #. ``transform()``: Apply a transformation to a ``Dataset``. If the ``Preprocessor`` is stateful, then ``fit()`` must be called first. This method is typically called on training, validation, and test datasets. #. ``transform_batch()``: Apply a transformation to a single :class:`batch ` of data. This method is typically called on online or offline inference data. #. ``fit_transform()``: Syntactic sugar for calling both ``fit()`` and ``transform()`` on a ``Dataset``. To show these methods in action, let's walk through a basic example. First, we'll set up two simple Ray ``Dataset``\s. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __preprocessor_setup_start__ :end-before: __preprocessor_setup_end__ Next, ``fit`` the ``Preprocessor`` on one ``Dataset``, and then ``transform`` both ``Dataset``\s with this fitted information. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __preprocessor_fit_transform_start__ :end-before: __preprocessor_fit_transform_end__ Finally, call ``transform_batch`` on a single batch of data. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __preprocessor_transform_batch_start__ :end-before: __preprocessor_transform_batch_end__ Life of an AIR preprocessor --------------------------- Now that we've gone over the basics, let's dive into how ``Preprocessor``\s fit into an end-to-end application built with AIR. The diagram below depicts an overview of the main steps of a ``Preprocessor``: #. Passed into a ``Trainer`` to ``fit`` and ``transform`` input ``Dataset``\s #. Saved as a ``Checkpoint`` #. Reconstructed in a ``Predictor`` to ``fit_batch`` on batches of data .. figure:: images/air-preprocessor.svg Throughout this section we'll go through this workflow in more detail, with code examples using XGBoost. The same logic is applicable to other machine learning framework integrations as well. Trainer ~~~~~~~ The journey of the ``Preprocessor`` starts with the :class:`Trainer `. If the ``Trainer`` is instantiated with a ``Preprocessor``, then the following logic is executed when ``Trainer.fit()`` is called: #. If a ``"train"`` ``Dataset`` is passed in, then the ``Preprocessor`` calls ``fit()`` on it. #. The ``Preprocessor`` then calls ``transform()`` on all ``Dataset``\s, including the ``"train"`` ``Dataset``. #. The ``Trainer`` then performs training on the preprocessed ``Dataset``\s. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __trainer_start__ :end-before: __trainer_end__ .. note:: If you're passing a ``Preprocessor`` that is already fitted, it is refitted on the ``"train"`` ``Dataset``. Adding the functionality to support passing in a fitted Preprocessor is being tracked `here `__. .. TODO: Remove the note above once the issue is resolved. Tune ~~~~ If you're using ``Ray Tune`` for hyperparameter optimization, be aware that each ``Trial`` instantiates its own copy of the ``Preprocessor`` and the fitting and transforming logic occur once per ``Trial``. Checkpoint ~~~~~~~~~~ ``Trainer.fit()`` returns a ``Result`` object which contains a ``Checkpoint``. If a ``Preprocessor`` is passed into the ``Trainer``, then it is saved in the ``Checkpoint`` along with any fitted state. As a sanity check, let's confirm the ``Preprocessor`` is available in the ``Checkpoint``. In practice, you don't need to check. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __checkpoint_start__ :end-before: __checkpoint_end__ Predictor ~~~~~~~~~ A ``Predictor`` can be constructed from a saved ``Checkpoint``. If the ``Checkpoint`` contains a ``Preprocessor``, then the ``Preprocessor`` calls ``transform_batch`` on input batches prior to performing inference. In the following example, we show the Batch Predictor flow. The same logic applies to the :ref:`Online Inference flow `. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __predictor_start__ :end-before: __predictor_end__ Types of preprocessors ---------------------- Built-in preprocessors ~~~~~~~~~~~~~~~~~~~~~~ Ray AIR provides a handful of preprocessors out of the box. **Generic preprocessors** .. autosummary:: :nosignatures: ray.data.preprocessors.BatchMapper ray.data.preprocessors.Chain ray.data.preprocessors.Concatenator ray.data.preprocessor.Preprocessor ray.data.preprocessors.SimpleImputer **Categorical encoders** .. autosummary:: :nosignatures: ray.data.preprocessors.Categorizer ray.data.preprocessors.LabelEncoder ray.data.preprocessors.MultiHotEncoder ray.data.preprocessors.OneHotEncoder ray.data.preprocessors.OrdinalEncoder **Feature scalers** .. autosummary:: :nosignatures: ray.data.preprocessors.MaxAbsScaler ray.data.preprocessors.MinMaxScaler ray.data.preprocessors.Normalizer ray.data.preprocessors.PowerTransformer ray.data.preprocessors.RobustScaler ray.data.preprocessors.StandardScaler **Text encoders** .. autosummary:: :nosignatures: ray.data.preprocessors.CountVectorizer ray.data.preprocessors.HashingVectorizer ray.data.preprocessors.Tokenizer ray.data.preprocessors.FeatureHasher **Utilities** .. autosummary:: :nosignatures: ray.data.Dataset.train_test_split Which preprocessor should you use? ---------------------------------- The type of preprocessor you use depends on what your data looks like. This section provides tips on handling common data formats. General-purpose preprocessors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are many general-purpose preprocessors you can employ to transform your data. For example, you can chain preprocessors, fill in missing values, or implement custom preprocessors. Chaining preprocessors ^^^^^^^^^^^^^^^^^^^^^^ If you need to apply more than one preprocessor, compose them together with :class:`~ray.data.preprocessors.Chain`. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __chain_start__ :end-before: __chain_end__ .. tip:: :class:`~ray.data.preprocessors.Chain` applies ``fit`` and ``transform`` sequentially. For example, if you construct ``Chain(preprocessorA, preprocessorB)``, then ``preprocessorB.transform`` is applied to the result of ``preprocessorA.transform``. Filling in missing values ^^^^^^^^^^^^^^^^^^^^^^^^^ If your dataset contains missing values, replace them with :class:`~ray.data.preprocessors.SimpleImputer`. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __simple_imputer_start__ :end-before: __simple_imputer_end__ .. _air-custom-preprocessors: Implementing custom preprocessors ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you want to implement a custom preprocessor that needs to be fit, extend the :class:`~ray.data.preprocessor.Preprocessor` base class. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __custom_stateful_start__ :end-before: __custom_stateful_end__ If your preprocessor doesn't need to be fit, construct a :class:`~ray.data.preprocessors.BatchMapper`. :class:`~ray.data.preprocessors.BatchMapper` can drop, add, or modify columns. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __custom_stateless_start__ :end-before: __custom_stateless_end__ Categorical data ~~~~~~~~~~~~~~~~ Most models expect numerical inputs. To represent your categorical data in a way your model can understand, encode categories using one of the preprocessors described below. .. list-table:: :header-rows: 1 * - Categorical Data Type - Example - Preprocessor * - Labels - ``"cat"``, ``"dog"``, ``"airplane"`` - :class:`~ray.data.preprocessors.LabelEncoder` * - Ordered categories - ``"bs"``, ``"md"``, ``"phd"`` - :class:`~ray.data.preprocessors.OrdinalEncoder` * - Unordered categories - ``"red"``, ``"green"``, ``"blue"`` - :class:`~ray.data.preprocessors.OneHotEncoder` * - Lists of categories - ``("sci-fi", "action")``, ``("action", "comedy", "animated")`` - :class:`~ray.data.preprocessors.MultiHotEncoder` .. note:: If you're using LightGBM, you don't need to encode your categorical data. Instead, use :class:`~ray.data.preprocessors.Categorizer` to convert your data to `pandas.CategoricalDtype`. Numerical data ~~~~~~~~~~~~~~ To ensure your models behaves properly, normalize your numerical data. Reference the table below to determine which preprocessor to use. .. list-table:: :header-rows: 1 * - Data Property - Preprocessor * - Your data is approximately normal - :class:`~ray.data.preprocessors.StandardScaler` * - Your data is sparse - :class:`~ray.data.preprocessors.MaxAbsScaler` * - Your data contains many outliers - :class:`~ray.data.preprocessors.RobustScaler` * - Your data isn't normal, but you need it to be - :class:`~ray.data.preprocessors.PowerTransformer` * - You need unit-norm rows - :class:`~ray.data.preprocessors.Normalizer` * - You aren't sure what your data looks like - :class:`~ray.data.preprocessors.MinMaxScaler` .. warning:: These preprocessors operate on numeric columns. If your dataset contains columns of type :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype`, you may need to :ref:`implement a custom preprocessor `. Additionally, if your model expects a tensor or ``ndarray``, create a tensor using :class:`~ray.data.preprocessors.Concatenator`. .. tip:: Built-in feature scalers like :class:`~ray.data.preprocessors.StandardScaler` don't work on :class:`~ray.air.util.tensor_extensions.pandas.TensorDtype` columns, so apply :class:`~ray.data.preprocessors.Concatenator` after feature scaling. Combine feature scaling and concatenation into a single preprocessor with :class:`~ray.data.preprocessors.Chain`. .. literalinclude:: doc_code/preprocessors.py :language: python :start-after: __concatenate_start__ :end-before: __concatenate_end__ Text Data ~~~~~~~~~ A `document-term matrix `_ is a table that describes text data. It's useful for natural language processing. To generate a document-term matrix from a collection of documents, use :class:`~ray.data.preprocessors.HashingVectorizer` or :class:`~ray.data.preprocessors.CountVectorizer`. If already know the frequency of tokens and want to store the data in a document-term matrix, use :class:`~ray.data.preprocessors.FeatureHasher`. .. list-table:: :header-rows: 1 * - Requirement - Preprocessor * - You care about memory efficiency - :class:`~ray.data.preprocessors.HashingVectorizer` * - You care about model interpretability - :class:`~ray.data.preprocessors.CountVectorizer`