.. _creating_datasets: ================= Creating Datasets ================= You can get started by creating Datasets from synthetic data using ``ray.data.range()`` and ``ray.data.from_items()``. Datasets can hold either plain Python objects (i.e. their schema is a Python type), or Arrow records (in which case their schema is Arrow). .. code-block:: python import ray # Create a Dataset of Python objects. ds = ray.data.range(10000) # -> Dataset(num_blocks=200, num_rows=10000, schema=) ds.take(5) # -> [0, 1, 2, 3, 4] ds.count() # -> 10000 # Create a Dataset of Arrow records. ds = ray.data.from_items([{"col1": i, "col2": str(i)} for i in range(10000)]) # -> Dataset(num_blocks=200, num_rows=10000, schema={col1: int64, col2: string}) ds.show(5) # -> {'col1': 0, 'col2': '0'} # -> {'col1': 1, 'col2': '1'} # -> {'col1': 2, 'col2': '2'} # -> {'col1': 3, 'col2': '3'} # -> {'col1': 4, 'col2': '4'} ds.schema() # -> col1: int64 # -> col2: string Datasets can be created from files on local disk or remote datasources such as S3. Any filesystem `supported by pyarrow `__ can be used to specify file locations: .. code-block:: python # Read a directory of files in remote storage. ds = ray.data.read_csv("s3://bucket/path") # Read multiple local files. ds = ray.data.read_csv(["/path/to/file1", "/path/to/file2"]) # Read multiple directories. ds = ray.data.read_csv(["s3://bucket/path1", "s3://bucket/path2"]) Finally, you can create a ``Dataset`` from existing data in the Ray object store or Ray-compatible distributed DataFrames: .. code-block:: python import pandas as pd import dask.dataframe as dd # Create a Dataset from a list of Pandas DataFrame objects. pdf = pd.DataFrame({"one": [1, 2, 3], "two": ["a", "b", "c"]}) ds = ray.data.from_pandas([pdf]) # Create a Dataset from a Dask-on-Ray DataFrame. dask_df = dd.from_pandas(pdf, npartitions=10) ds = ray.data.from_dask(dask_df) From Torch/TensorFlow --------------------------------------- .. tabbed:: PyTorch If you already have a Torch dataset available, you can create a Ray Dataset using :py:class:`~ray.data.datasource.SimpleTorchDatasource`. .. warning:: :py:class:`~ray.data.datasource.SimpleTorchDatasource` doesn't support parallel reads. You should only use this datasource for small datasets like MNIST or CIFAR. .. code-block:: python import ray.data from ray.data.datasource import SimpleTorchDatasource import torchvision dataset_factory = lambda: torchvision.datasets.MNIST("data", download=True) dataset = ray.data.read_datasource( SimpleTorchDatasource(), parallelism=1, dataset_factory=dataset_factory ) dataset.take(1) # (, 5) .. tabbed:: TensorFlow If you already have a TensorFlow dataset available, you can create a Ray Dataset using :py:class:`SimpleTensorFlowDatasource`. .. warning:: :py:class:`SimpleTensorFlowDatasource` doesn't support parallel reads. You should only use this datasource for small datasets like MNIST or CIFAR. .. code-block:: python import ray.data from ray.data.datasource import SimpleTensorFlowDatasource import tensorflow_datasets as tfds def dataset_factory(): return tfds.load("cifar10", split=["train"], as_supervised=True)[0] dataset = ray.data.read_datasource( SimpleTensorFlowDatasource(), parallelism=1, dataset_factory=dataset_factory ) features, label = dataset.take(1)[0] features.shape # TensorShape([32, 32, 3]) label #