ray/doc/source/data/examples/nyc_taxi_basic_processing.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ed5b22a4",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# flake8: noqa\n",
    "import warnings\n",
    "import os\n",
    "\n",
    "# Suppress noisy requests warnings.\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "os.environ[\"PYTHONWARNINGS\"] = \"ignore\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94ccdbcf",
   "metadata": {},
   "source": [
    "# Processing NYC taxi data using Ray Datasets\n",
    "\n",
    "The [NYC Taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) is a popular tabular dataset.  In this example, we demonstrate some basic data processing on this dataset using Ray Datasets.\n",
    "\n",
    "## Overview\n",
    "\n",
    "This tutorial will cover:\n",
    " - Reading Parquet data\n",
    " - Inspecting the metadata and first few rows of a large Ray {class}`Dataset <ray.data.Dataset>`\n",
    " - Calculating some common global and grouped statistics on the dataset\n",
    " - Dropping columns and rows\n",
    " - Adding a derived column\n",
    " - Shuffling the dataset\n",
    " - Sharding the dataset and feeding it to parallel consumers (trainers)\n",
    " - Applying batch (offline) inference to the data\n",
    "\n",
    "## Walkthrough\n",
    "\n",
    "Let's start by importing Ray and initializing a local Ray cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "366de039",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-05-18 18:37:54,818\tINFO services.py:1484 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8266\u001b[39m\u001b[22m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "RayContext(dashboard_url='127.0.0.1:8266', python_version='3.7.13', ray_version='2.0.0.dev0', ray_commit='{{RAY_COMMIT_SHA}}', address_info={'node_ip_address': '172.31.46.244', 'raylet_ip_address': '172.31.46.244', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-05-18_18-37-50_553007_794791/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-05-18_18-37-50_553007_794791/sockets/raylet', 'webui_url': '127.0.0.1:8266', 'session_dir': '/tmp/ray/session_2022-05-18_18-37-50_553007_794791', 'metrics_export_port': 49419, 'gcs_address': '172.31.46.244:58837', 'address': '172.31.46.244:58837', 'node_id': '6ef10d33a5b9227b41e857b3a9488bcb958a092fef0538798a800e97'})"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import ray and initialize a local Ray cluster.\n",
    "import ray\n",
    "ray.init()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efb202d0",
   "metadata": {},
   "source": [
    "### Reading and Inspecting the Data\n",
    "\n",
    "Next, we read a few of the files from the dataset. This read is semi-lazy, where reading of the first file is eagerly executed, but reading of all other files is delayed until the underlying data is needed by downstream operations (e.g. consuming the data with {meth}`ds.take() <ray.data.Dataset.take>`, or transforming the data with {meth}`ds.map_batches() <ray.data.Dataset.map_batches>`).\n",
    "\n",
    "We could process the entire Dataset in a streaming fashion using {ref}`pipelining <dataset_pipeline_concept>` or all of it in parallel using a multi-node Ray cluster, but we'll save that for our large-scale examples. :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "269da3de",
   "metadata": {
    "scrolled": false,
    "tags": [
     "remove-output"
    ]
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "# Read two Parquet files in parallel.\n",
    "ds = ray.data.read_parquet([\n",
    "    \"s3://ursa-labs-taxi-data/2009/01/data.parquet\",\n",
    "    \"s3://ursa-labs-taxi-data/2009/02/data.parquet\",\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26eb4950",
   "metadata": {},
   "source": [
    "We can easily inspect the schema of this dataset. For Parquet files, we don't even have to read the actual data to get the schema; we can read it from the lightweight Parquet metadata!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "c6c2f47d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "vendor_id: string\n",
       "pickup_at: timestamp[us]\n",
       "dropoff_at: timestamp[us]\n",
       "passenger_count: int8\n",
       "trip_distance: float\n",
       "pickup_longitude: float\n",
       "pickup_latitude: float\n",
       "rate_code_id: null\n",
       "store_and_fwd_flag: string\n",
       "dropoff_longitude: float\n",
       "dropoff_latitude: float\n",
       "payment_type: string\n",
       "fare_amount: float\n",
       "extra: float\n",
       "mta_tax: float\n",
       "tip_amount: float\n",
       "tolls_amount: float\n",
       "total_amount: float\n",
       "-- schema metadata --\n",
       "pandas: '{\"index_columns\": [{\"kind\": \"range\", \"name\": null, \"start\": 0, \"' + 2527"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fetch the schema from the underlying Parquet metadata.\n",
    "ds.schema()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fceebe4d",
   "metadata": {},
   "source": [
    "Parquet even stores the number of rows per file in the Parquet metadata, so we can get the number of rows in ``ds`` without triggering a full data read."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5812dacf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "27472535"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cad044a7",
   "metadata": {},
   "source": [
    "We can get a nice, cheap summary of the ``Dataset`` by leveraging it's informative repr:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "63894b9c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset(num_blocks=2, num_rows=27472535, schema={vendor_id: string, pickup_at: timestamp[us], dropoff_at: timestamp[us], passenger_count: int8, trip_distance: float, pickup_longitude: float, pickup_latitude: float, rate_code_id: null, store_and_fwd_flag: string, dropoff_longitude: float, dropoff_latitude: float, payment_type: string, fare_amount: float, extra: float, mta_tax: float, tip_amount: float, tolls_amount: float, total_amount: float})"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display some metadata about the dataset.\n",
    "ds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7e0f9d4",
   "metadata": {},
   "source": [
    "We can also poke at the actual data, taking a peek at a single row. Since this is only returning a row from the first file, reading of the second file is **not** triggered yet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6e653e63",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[ArrowRow({'vendor_id': 'VTS',\n",
       "           'pickup_at': datetime.datetime(2009, 1, 4, 2, 52),\n",
       "           'dropoff_at': datetime.datetime(2009, 1, 4, 3, 2),\n",
       "           'passenger_count': 1,\n",
       "           'trip_distance': 2.630000114440918,\n",
       "           'pickup_longitude': -73.99195861816406,\n",
       "           'pickup_latitude': 40.72156524658203,\n",
       "           'rate_code_id': None,\n",
       "           'store_and_fwd_flag': None,\n",
       "           'dropoff_longitude': -73.99380493164062,\n",
       "           'dropoff_latitude': 40.6959228515625,\n",
       "           'payment_type': 'CASH',\n",
       "           'fare_amount': 8.899999618530273,\n",
       "           'extra': 0.5,\n",
       "           'mta_tax': None,\n",
       "           'tip_amount': 0.0,\n",
       "           'tolls_amount': 0.0,\n",
       "           'total_amount': 9.399999618530273})]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.take(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5792f41a",
   "metadata": {},
   "source": [
    "To get a better sense of the data size, we can calculate the size in bytes of the full dataset. Note that for Parquet files, this size-in-bytes will be pulled from the Parquet metadata (not triggering a data read) and will therefore be the on-disk size of the data; this might be significantly smaller than the in-memory size!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7f0b8702",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "897130464"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.size_bytes()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a94d596",
   "metadata": {},
   "source": [
    "In order to get the in-memory size, we can trigger full reading of the dataset and inspect the size in bytes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f2d46bfa",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Read progress: 100%|██████████| 2/2 [00:04<00:00,  2.25s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "2263031675"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.fully_executed().size_bytes()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1b895fd",
   "metadata": {},
   "source": [
    "#### Advanced Aside - Reading Partitioned Parquet Datasets\n",
    "\n",
    "In addition to being able to read lists of individual files, {func}`ray.data.read_parquet() <ray.data.read_parquet>` (as well as other ``ray.data.read_*()`` APIs) can read directories containing multiple Parquet files. For Parquet in particular, reading Parquet datasets partitioned by a particular column is supported, allowing for path-based (zero-read) partition filtering and (optionally) including the partition column value specified in the file paths directly in the read table data.\n",
    "\n",
    "For the NYC taxi dataset, instead of reading individual per-month Parquet files, we can read the entire 2009 directory.\n",
    "\n",
    "```{warning}\n",
    "This will be a lot of data (~5.6 GB on disk, ~14 GB in memory), so be careful trigger full reads on a limited-memory machine! This is one place where Datasets' semi-lazy reading comes in handy: Datasets will only read one file eagerly, which allows us to inspect a subset of the data without having to read the entire dataset.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9cc641b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read all Parquet data for the year 2009.\n",
    "year_ds = ray.data.read_parquet(\"s3://ursa-labs-taxi-data/2009\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "789b4a20",
   "metadata": {},
   "source": [
    "The metadata that Datasets prints in its repr is guaranteed to not trigger reads of all files; data such as the row count and the schema is pulled directly from the Parquet metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "6383b4a2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset(num_blocks=12, num_rows=170896055, schema={vendor_id: string, pickup_at: timestamp[us], dropoff_at: timestamp[us], passenger_count: int8, trip_distance: float, pickup_longitude: float, pickup_latitude: float, rate_code_id: null, store_and_fwd_flag: string, dropoff_longitude: float, dropoff_latitude: float, payment_type: string, fare_amount: float, extra: float, mta_tax: float, tip_amount: float, tolls_amount: float, total_amount: float})"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "year_ds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42056adb",
   "metadata": {},
   "source": [
    "That's a lot of rows! Since we're not going to use this full-year dataset, let's now delete this dataset to free up some memory in our Ray cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "9703a5dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "del year_ds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3972f20a",
   "metadata": {},
   "source": [
    "### Data Exploration and Cleaning\n",
    "\n",
    "Let's calculate some stats to get a better picture of our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8a403bb4",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Read: 100%|██████████| 2/2 [00:06<00:00,  3.13s/it]\n",
      "Shuffle Map: 100%|██████████| 2/2 [00:00<00:00,  6.27it/s]\n",
      "Shuffle Reduce: 100%|██████████| 1/1 [00:00<00:00, 63.47it/s]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "ArrowRow({'max(trip_distance)': 50.0,\n",
       "          'max(tip_amount)': 100.0,\n",
       "          'max(passenger_count)': 113})"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# What's the longets trip distance, largest tip amount, and most number of passengers?\n",
    "ds.max([\"trip_distance\", \"tip_amount\", \"passenger_count\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b72a3363",
   "metadata": {},
   "source": [
    "Whoa, there was a trip with 113 people in the taxi!? Let's check out these kind of many-passenger records by filtering to just these records using our {meth}`ds.map_batches() <ray.data.Dataset.map_batches>` batch mapping API.\n",
    "\n",
    ":::{note}\n",
    "Our filtering UDF receives a Pandas DataFrame, which is the default batch format for tabular data, and returns a Pandas DataFrame, which keeps the Dataset in a tabular format.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "fa539237",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Read->Map_Batches: 100%|██████████| 2/2 [00:15<00:00,  7.80s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[PandasRow({'vendor_id': 'VTS',\n",
       "            'pickup_at': Timestamp('2009-01-22 11:47:00'),\n",
       "            'dropoff_at': Timestamp('2009-01-22 12:00:00'),\n",
       "            'passenger_count': 113,\n",
       "            'trip_distance': 0.0,\n",
       "            'pickup_longitude': 3555.912841796875,\n",
       "            'pickup_latitude': 935.5253295898438,\n",
       "            'rate_code_id': None,\n",
       "            'store_and_fwd_flag': None,\n",
       "            'dropoff_longitude': -74.01129913330078,\n",
       "            'dropoff_latitude': 1809.957763671875,\n",
       "            'payment_type': 'CASH',\n",
       "            'fare_amount': 13.300000190734863,\n",
       "            'extra': 0.0,\n",
       "            'mta_tax': nan,\n",
       "            'tip_amount': 0.0,\n",
       "            'tolls_amount': 0.0,\n",
       "            'total_amount': 13.300000190734863})]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Whoa, 113 passengers? I need to see this record and other ones with lots of passengers.\n",
    "ds.map_batches(lambda df: df[df[\"passenger_count\"] > 10]).take()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3bc0bad",
   "metadata": {},
   "source": [
    "That seems weird, probably bad data, or at least data points that I'm not interested in. We should filter these out!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "b9fb839a",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Read->Map_Batches: 100%|██████████| 2/2 [00:49<00:00, 24.63s/it]\n"
     ]
    }
   ],
   "source": [
    "# Filter out all records with over 10 passengers.\n",
    "ds = ds.map_batches(lambda df: df[df[\"passenger_count\"] <= 10])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cabece7",
   "metadata": {},
   "source": [
    "We don't have any use for the ``store_and_fwd_flag`` or ``mta_tax`` columns, so let's drop those."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "67f9565b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Map_Batches: 100%|██████████| 2/2 [00:47<00:00, 23.77s/it]\n"
     ]
    }
   ],
   "source": [
    "# Drop some columns.\n",
    "ds = ds.drop_columns([\"store_and_fwd_flag\", \"mta_tax\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b89eba4b",
   "metadata": {},
   "source": [
    "Let's say we want to know how many trips there are for each passenger count."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "9f2de4f5",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Sort Sample: 100%|██████████| 2/2 [00:04<00:00,  2.15s/it]\n",
      "Shuffle Map: 100%|██████████| 2/2 [03:36<00:00, 108.13s/it]\n",
      "Shuffle Reduce: 100%|██████████| 2/2 [00:00<00:00, 112.32it/s]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[ArrowRow({'passenger_count': -127, 'count()': 2}),\n",
       " ArrowRow({'passenger_count': -48, 'count()': 45}),\n",
       " ArrowRow({'passenger_count': 0, 'count()': 794}),\n",
       " ArrowRow({'passenger_count': 1, 'count()': 18634337}),\n",
       " ArrowRow({'passenger_count': 2, 'count()': 4503747}),\n",
       " ArrowRow({'passenger_count': 3, 'count()': 1196381}),\n",
       " ArrowRow({'passenger_count': 4, 'count()': 559279}),\n",
       " ArrowRow({'passenger_count': 5, 'count()': 2452176}),\n",
       " ArrowRow({'passenger_count': 6, 'count()': 125773})]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.groupby(\"passenger_count\").count().take()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5678a4fc",
   "metadata": {},
   "source": [
    "Again, it looks like there are some more nonsensical passenger counts, i.e. the negative ones. Let's filter those out too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3a0a9567",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Map_Batches: 100%|██████████| 2/2 [00:47<00:00, 23.69s/it]\n"
     ]
    }
   ],
   "source": [
    "# Filter our records with negative passenger counts.\n",
    "ds = ds.map_batches(lambda df: df[df[\"passenger_count\"] > 0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2353b44",
   "metadata": {},
   "source": [
    "#### Advanced Aside - Projection and Filter Pushdown\n",
    "\n",
    "Note that Ray Datasets' Parquet reader supports projection (column selection) and row filter pushdown, where we can push the above column selection and the row-based filter to the Parquet read. If we specify column selection at Parquet read time, the unselected columns won't even be read from disk!\n",
    "\n",
    "The row-based filter is specified via\n",
    "[Arrow's dataset field expressions](https://arrow.apache.org/docs/6.0/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression). See the {ref}`feature guide for reading Parquet data <dataset_supported_file_formats>` for more information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "baa016b7",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Read progress: 100%|██████████| 2/2 [00:00<00:00,  2.76it/s]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Dataset(num_blocks=2, num_rows=27471693, schema={passenger_count: int8, trip_distance: float})"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Only read the passenger_count and trip_distance columns.\n",
    "import pyarrow as pa\n",
    "filter_expr = (\n",
    "    (pa.dataset.field(\"passenger_count\") <= 10)\n",
    "    & (pa.dataset.field(\"passenger_count\") > 0)\n",
    ")\n",
    "\n",
    "pushdown_ds = ray.data.read_parquet(\n",
    "    [\n",
    "        \"s3://ursa-labs-taxi-data/2009/01/data.parquet\",\n",
    "        \"s3://ursa-labs-taxi-data/2009/02/data.parquet\",\n",
    "    ],\n",
    "    columns=[\"passenger_count\", \"trip_distance\"],\n",
    "    filter=filter_expr,\n",
    ")\n",
    "\n",
    "# Force full execution of both of the file reads.\n",
    "pushdown_ds = pushdown_ds.fully_executed()\n",
    "pushdown_ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "d1847d7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete the pushdown dataset. Deleting the Dataset object\n",
    "# will release the underlying memory in the cluster.\n",
    "del pushdown_ds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11df0946",
   "metadata": {},
   "source": [
    "Do the passenger counts influences the typical trip distance?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "245a8a97",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Sort Sample: 100%|██████████| 2/2 [00:04<00:00,  2.24s/it]\n",
      "Shuffle Map: 100%|██████████| 2/2 [03:28<00:00, 104.24s/it]\n",
      "Shuffle Reduce: 100%|██████████| 2/2 [00:00<00:00, 123.35it/s]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[ArrowRow({'passenger_count': 1, 'mean(trip_distance)': 2.5442271984282017}),\n",
       " ArrowRow({'passenger_count': 2, 'mean(trip_distance)': 2.701997813992574}),\n",
       " ArrowRow({'passenger_count': 3, 'mean(trip_distance)': 2.624621515664268}),\n",
       " ArrowRow({'passenger_count': 4, 'mean(trip_distance)': 2.6351745332066048}),\n",
       " ArrowRow({'passenger_count': 5, 'mean(trip_distance)': 2.628660744359485}),\n",
       " ArrowRow({'passenger_count': 6, 'mean(trip_distance)': 2.5804354108726586})]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Mean trip distance grouped by passenger count.\n",
    "ds.groupby(\"passenger_count\").mean(\"trip_distance\").take()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5172b080",
   "metadata": {},
   "source": [
    "See the feature guides for {ref}`transforming data <transforming_datasets>` and {ref}`ML preprocessing <datasets-ml-preprocessing>` for more information on how we can process our data with Ray Datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c2e28bf",
   "metadata": {},
   "source": [
    "### Ingesting into Model Trainers\n",
    "\n",
    "Now that we've learned more about our data and we have cleaned up our dataset a bit, we now look at how we can feed this dataset into some dummy model trainers.\n",
    "\n",
    "First, let's do a full global random shuffle of the dataset to decorrelate these samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "643acc6f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Shuffle Map: 100%|██████████| 2/2 [00:21<00:00, 10.92s/it]\n",
      "Shuffle Reduce: 100%|██████████| 2/2 [00:35<00:00, 17.59s/it]\n"
     ]
    }
   ],
   "source": [
    "ds = ds.random_shuffle()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22c74175",
   "metadata": {},
   "source": [
    "We define a dummy ``Trainer`` actor, where each trainer will consume a dataset shard in batches and simulate model training.\n",
    "\n",
    ":::{note}\n",
    "In a real training workflow, we would feed ``ds`` to {ref}`Ray Train <train-docs>`, which would do this sharding and creation of training actors for us, under the hood.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "c192e4d7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Actor(Trainer, 6d81e32e1d1582f89ca75e3c01000000),\n",
       " Actor(Trainer, 84887785bc1a9d5b697728be01000000),\n",
       " Actor(Trainer, b57750338c40513819fe4d8301000000),\n",
       " Actor(Trainer, a393b1c25a8a1b42754959cf01000000)]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@ray.remote\n",
    "class Trainer:\n",
    "    def __init__(self, rank: int):\n",
    "        pass\n",
    "\n",
    "    def train(self, shard: ray.data.Dataset) -> int:\n",
    "        for batch in shard.iter_batches(batch_size=256):\n",
    "            pass\n",
    "        return shard.count()\n",
    "\n",
    "trainers = [Trainer.remote(i) for i in range(4)]\n",
    "trainers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b10fc64",
   "metadata": {},
   "source": [
    "Next, we split the dataset into ``len(trainers)`` shards, ensuring that the shards are of equal size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "d175439a",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Dataset(num_blocks=1, num_rows=6867923, schema={vendor_id: object, pickup_at: datetime64[ns], dropoff_at: datetime64[ns], passenger_count: int8, trip_distance: float32, pickup_longitude: float32, pickup_latitude: float32, rate_code_id: object, dropoff_longitude: float32, dropoff_latitude: float32, payment_type: object, fare_amount: float32, extra: float32, tip_amount: float32, tolls_amount: float32, total_amount: float32}),\n",
       " Dataset(num_blocks=1, num_rows=6867923, schema={vendor_id: object, pickup_at: datetime64[ns], dropoff_at: datetime64[ns], passenger_count: int8, trip_distance: float32, pickup_longitude: float32, pickup_latitude: float32, rate_code_id: object, dropoff_longitude: float32, dropoff_latitude: float32, payment_type: object, fare_amount: float32, extra: float32, tip_amount: float32, tolls_amount: float32, total_amount: float32}),\n",
       " Dataset(num_blocks=1, num_rows=6867923, schema={vendor_id: object, pickup_at: datetime64[ns], dropoff_at: datetime64[ns], passenger_count: int8, trip_distance: float32, pickup_longitude: float32, pickup_latitude: float32, rate_code_id: object, dropoff_longitude: float32, dropoff_latitude: float32, payment_type: object, fare_amount: float32, extra: float32, tip_amount: float32, tolls_amount: float32, total_amount: float32}),\n",
       " Dataset(num_blocks=1, num_rows=6867923, schema={vendor_id: object, pickup_at: datetime64[ns], dropoff_at: datetime64[ns], passenger_count: int8, trip_distance: float32, pickup_longitude: float32, pickup_latitude: float32, rate_code_id: object, dropoff_longitude: float32, dropoff_latitude: float32, payment_type: object, fare_amount: float32, extra: float32, tip_amount: float32, tolls_amount: float32, total_amount: float32})]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "shards = ds.split(n=len(trainers), equal=True)\n",
    "shards"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97c35aae",
   "metadata": {},
   "source": [
    "Finally, we simulate training, passing each shard to the corresponding trainer. The number of rows per shard is returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "d60d0e0d",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[6867923, 6867923, 6867923, 6867923]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ray.get([w.train.remote(s) for w, s in zip(trainers, shards)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "b1ae3f38",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete trainer actor handle references, which should terminate the actors.\n",
    "del trainers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60c90def",
   "metadata": {},
   "source": [
    "### Parallel Batch Inference\n",
    "\n",
    "After we've trained a model, we may want to perform batch (offline) inference on such a tabular dataset. With Ray Datasets, this is as easy as a {meth}`ds.map_batches() <ray.data.Dataset.map_batches>` call!\n",
    "\n",
    "First, we define a callable class that will cache the loading of the model in its constructor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "de681909",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "def load_model():\n",
    "    # A dummy model.\n",
    "    def model(batch: pd.DataFrame) -> pd.DataFrame:\n",
    "        return pd.DataFrame({\"score\": batch[\"passenger_count\"] % 2 == 0})\n",
    "    \n",
    "    return model\n",
    "\n",
    "class BatchInferModel:\n",
    "    def __init__(self):\n",
    "        self.model = load_model()\n",
    "    def __call__(self, batch: pd.DataFrame) -> pd.DataFrame:\n",
    "        return self.model(batch)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c866e637",
   "metadata": {},
   "source": [
    "``BatchInferModel``'s constructor will only be called once per actor worker when using the actor pool compute strategy in {meth}`ds.map_batches() <ray.data.Dataset.map_batches>`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "04fac86d",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Map Progress (8 actors 2 pending):  50%|█████     | 1/2 [00:14<00:14, 14.75s/it]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Map Progress (8 actors 2 pending): 100%|██████████| 2/2 [00:28<00:00, 14.36s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[PandasRow({'score': True}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': True}),\n",
       " PandasRow({'score': True}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': True}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False})]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.map_batches(BatchInferModel, batch_size=2048, compute=\"actors\").take()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01f3aa42",
   "metadata": {},
   "source": [
    "If wanting to perform batch inference on GPUs, simply specify the number of GPUs you wish to provision for each batch inference worker.\n",
    "\n",
    ":::{warning}\n",
    "This will only run successfully if your cluster has nodes with GPUs!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "0c7365b7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Map Progress (8 actors 2 pending):   0%|          | 0/2 [00:06<?, ?it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "Map Progress (8 actors 2 pending):  50%|█████     | 1/2 [01:31<01:31, 91.10s/it]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Map Progress (8 actors 2 pending): 100%|██████████| 2/2 [03:00<00:00, 90.33s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[PandasRow({'score': True}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': True}),\n",
       " PandasRow({'score': True}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': True}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False})]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds.map_batches(\n",
    "    BatchInferModel,\n",
    "    batch_size=256,\n",
    "    #num_gpus=1,  # Uncomment this to run this on GPUs!\n",
    "    compute=\"actors\",\n",
    ").take()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6806ab7",
   "metadata": {},
   "source": [
    "We can also configure the autoscaling actor pool that this inference stage uses, setting upper and lower bounds on the actor pool size, and even tweak the batch prefetching vs. inference task queueing tradeoff."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "c382fc0a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Map Progress (8 actors 0 pending): 100%|██████████| 2/2 [02:56<00:00, 88.40s/it]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[PandasRow({'score': True}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': True}),\n",
       " PandasRow({'score': True}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': True}),\n",
       " PandasRow({'score': False}),\n",
       " PandasRow({'score': False})]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from ray.data import ActorPoolStrategy\n",
    "\n",
    "# The actor pool will have at least 2 workers and at most 8 workers.\n",
    "strategy = ActorPoolStrategy(min_size=2, max_size=8)\n",
    "\n",
    "ds.map_batches(\n",
    "    BatchInferModel,\n",
    "    batch_size=256,\n",
    "    #num_gpus=1,  # Uncomment this to run this on GPUs!\n",
    "    compute=strategy,\n",
    ").take()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a0db5ba1",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}