Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Matthew Deng <matt@anyscale.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
This adds the following options to DatasetConfig, which can be used to enable streaming ingest.
```
# Whether the dataset should be streamed into memory using pipelined reads.
# When enabled, get_dataset_shard() returns DatasetPipeline instead of Dataset.
# The amount of memory to use is controlled by `stream_window_size`.
# False by default for all datasets.
use_stream_api: Optional[bool] = None
# Configure the streaming window size in bytes. A typical value is something like
# 20% of object store memory. If set to -1, then an infinite window size will be
# used (similar to bulk ingest). This only has an effect if use_stream_api is set.
# Set to 1.0 GiB by default.
stream_window_size: Optional[float] = None
# Whether to enable global shuffle (per pipeline window in streaming mode). Note
# that this is an expensive all-to-all operation, and most likely you want to use
# local shuffle instead.
# False by default for all datasets.
global_shuffle: Optional[bool] = None
```
This adds a per-dataset config object to DataParallelTrainer. These configs define how the Dataset should be read into the DataParallelTrainer. It configures the preprocessing, splitting, and ingest strategy per-dataset. DataParallelTrainers declare default DatasetConfigs for each dataset passed in the ``datasets`` argument. Users have the opportunity to selectively override these configs by passing the ``dataset_config`` argument. Trainers can also define user customizable values (e.g., XGBoostTrainer doesn't support streaming ingest).
This PR adds the minimal support for dataset configs. Future PRs will:
- Add support for streaming ingest
- Move this config from DataParallelTrainer to ml.Trainer
The package "ml" should be renamed to "air".
Main question: Keep a `ml.py` with `from ray.air import *` for some level of backwards compatibility?
I'd go for no to force people to use the new structure.