{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Integrate Ray AIR with Feast feature store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install feast==0.20.1 ray[air]>=1.13 xgboost_ray"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "INyNIaeB1Kza"
   },
   "source": [
    "In this example, we showcase how to use Ray AIR with Feast feature store, leveraging both historical features for training a model and online features for inference.\n",
    "\n",
    "The task is adapted from [Feast credit scoring tutorial](https://github.com/feast-dev/feast-aws-credit-scoring-tutorial). In this example, we train a xgboost model and run some prediction on an incoming loan request to see if it is approved or rejected. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sBC9CCrpzQLF"
   },
   "source": [
    "Let's first set up our workspace and prepare the data to work with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "id": "DcPIskZlzSal"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "WORKING_DIR = os.path.expanduser(\"~/ray-air-feast-example/\")\n",
    "%env WORKING_DIR=$WORKING_DIR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "BcyCKjV3zTCK",
    "outputId": "afdfa24d-e5ce-49db-c904-e961e1eb910c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2022-06-02 14:22:50--  https://github.com/ray-project/air-sample-data/raw/main/air-feast-example.zip\n",
      "Resolving github.com (github.com)... 192.30.255.113\n",
      "Connecting to github.com (github.com)|192.30.255.113|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://raw.githubusercontent.com/ray-project/air-sample-data/main/air-feast-example.zip [following]\n",
      "--2022-06-02 14:22:50--  https://raw.githubusercontent.com/ray-project/air-sample-data/main/air-feast-example.zip\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 23715107 (23M) [application/zip]\n",
      "Saving to: ‘air-feast-example.zip’\n",
      "\n",
      "air-feast-example.z 100%[===================>]  22.62M   114MB/s    in 0.2s    \n",
      "\n",
      "2022-06-02 14:22:51 (114 MB/s) - ‘air-feast-example.zip’ saved [23715107/23715107]\n",
      "\n",
      "Archive:  air-feast-example.zip\n",
      "   creating: air-feast-example/\n",
      "   creating: air-feast-example/feature_repo/\n",
      "  inflating: air-feast-example/feature_repo/.DS_Store  \n",
      " extracting: air-feast-example/feature_repo/__init__.py  \n",
      "  inflating: air-feast-example/feature_repo/features.py  \n",
      "   creating: air-feast-example/feature_repo/data/\n",
      "  inflating: air-feast-example/feature_repo/data/.DS_Store  \n",
      "  inflating: air-feast-example/feature_repo/data/credit_history_sample.csv  \n",
      "  inflating: air-feast-example/feature_repo/data/zipcode_table_sample.csv  \n",
      "  inflating: air-feast-example/feature_repo/data/credit_history.parquet  \n",
      "  inflating: air-feast-example/feature_repo/data/zipcode_table.parquet  \n",
      "  inflating: air-feast-example/feature_repo/feature_store.yaml  \n",
      "  inflating: air-feast-example/.DS_Store  \n",
      "   creating: air-feast-example/data/\n",
      "  inflating: air-feast-example/data/loan_table.parquet  \n",
      "  inflating: air-feast-example/data/loan_table_sample.csv  \n"
     ]
    }
   ],
   "source": [
    "! mkdir -p $WORKING_DIR\n",
    "! wget --no-check-certificate https://github.com/ray-project/air-sample-data/raw/main/air-feast-example.zip\n",
    "! unzip air-feast-example.zip \n",
    "! mv air-feast-example/* $WORKING_DIR\n",
    "%cd $WORKING_DIR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "iNbC-Qqi3Lq_",
    "outputId": "99576086-12dd-4f96-fb51-de40b77b15ce"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "data  feature_repo\n"
     ]
    }
   ],
   "source": [
    "! ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "c_3wlEus4dYO"
   },
   "source": [
    "There is already a feature repository set up in `feature_repo/`. It isn't necessary to create a new feature repository, but it can be done using the following command: `feast init -t local feature_repo`.\n",
    "\n",
    "Now let's take a look at the schema in Feast feature store, which is defined by `feature_repo/features.py`. There are mainly two features: zipcode_feature and credit_history, both are generated from parquet files - `feature_repo/data/zipcode_table.parquet` and `feature_repo/data/credit_history.parquet`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "5VGLhPLLzlGW",
    "outputId": "a3f3499e-c140-4ceb-a66d-2f1a6b8a2142"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mdatetime\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m timedelta\n",
      "\n",
      "\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mfeast\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m (Entity, Field, FeatureView, FileSource, ValueType)\n",
      "\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mfeast\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mtypes\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m Float32, Int64, String\n",
      "\n",
      "\n",
      "zipcode = Entity(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mzipcode\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, value_type=Int64)\n",
      "\n",
      "zipcode_source = FileSource(\n",
      "    path=\u001b[33m\"\u001b[39;49;00m\u001b[33mfeature_repo/data/zipcode_table.parquet\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
      "    timestamp_field=\u001b[33m\"\u001b[39;49;00m\u001b[33mevent_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
      "    created_timestamp_column=\u001b[33m\"\u001b[39;49;00m\u001b[33mcreated_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
      ")\n",
      "\n",
      "zipcode_features = FeatureView(\n",
      "    name=\u001b[33m\"\u001b[39;49;00m\u001b[33mzipcode_features\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
      "    entities=[\u001b[33m\"\u001b[39;49;00m\u001b[33mzipcode\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m],\n",
      "    ttl=timedelta(days=\u001b[34m3650\u001b[39;49;00m),\n",
      "    schema=[\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mcity\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=String),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mstate\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=String),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mlocation_type\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=String),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mtax_returns_filed\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mpopulation\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mtotal_wages\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "    ],\n",
      "    source=zipcode_source,\n",
      ")\n",
      "\n",
      "dob_ssn = Entity(\n",
      "    name=\u001b[33m\"\u001b[39;49;00m\u001b[33mdob_ssn\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
      "    value_type=ValueType.STRING,\n",
      "    description=\u001b[33m\"\u001b[39;49;00m\u001b[33mDate of birth and last four digits of social security number\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
      ")\n",
      "\n",
      "credit_history_source = FileSource(\n",
      "    path=\u001b[33m\"\u001b[39;49;00m\u001b[33mfeature_repo/data/credit_history.parquet\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
      "    timestamp_field=\u001b[33m\"\u001b[39;49;00m\u001b[33mevent_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
      "    created_timestamp_column=\u001b[33m\"\u001b[39;49;00m\u001b[33mcreated_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
      ")\n",
      "\n",
      "credit_history = FeatureView(\n",
      "    name=\u001b[33m\"\u001b[39;49;00m\u001b[33mcredit_history\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
      "    entities=[\u001b[33m\"\u001b[39;49;00m\u001b[33mdob_ssn\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m],\n",
      "    ttl=timedelta(days=\u001b[34m90\u001b[39;49;00m),\n",
      "    schema=[\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mcredit_card_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmortgage_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mstudent_loan_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mvehicle_loan_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mhard_pulls\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmissed_payments_2y\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmissed_payments_1y\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmissed_payments_6m\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "        Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mbankruptcies\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
      "    ],\n",
      "    source=credit_history_source,\n",
      ")\n"
     ]
    }
   ],
   "source": [
    "!pygmentize feature_repo/features.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HQmrfEV33_SM"
   },
   "source": [
    "Deploy the above defined feature store by running `apply` from within the feature_repo/ folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "SbL_EbMC2MFS",
    "outputId": "13b07f1f-d52a-4c4e-a73f-f5478c0304de"
   },
   "outputs": [],
   "source": [
    "! cd feature_repo && feast apply"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "D-9Kr-kdzg1r",
    "outputId": "beabd2f7-c1a6-4fe3-b087-b35b7b9ab56b"
   },
   "outputs": [],
   "source": [
    "import feast\n",
    "fs = feast.FeatureStore(repo_path=\"feature_repo\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5nR8uNE8z-YQ"
   },
   "source": [
    "## Generate training data\n",
    "On top of the features in Feast, we also have labeled training data at `data/loan_table.parquet`. At the time of training, loan table will be passed into Feast as an entity dataframe for training data generation. Feast will intelligently join credit_history and zipcode_feature tables to create relevant feature vectors to augment the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 424
    },
    "id": "twBCJMzVzV0X",
    "outputId": "efb41c7a-2802-4169-906e-7db1c37d8c8e"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>loan_id</th>\n",
       "      <th>dob_ssn</th>\n",
       "      <th>zipcode</th>\n",
       "      <th>person_age</th>\n",
       "      <th>person_income</th>\n",
       "      <th>person_home_ownership</th>\n",
       "      <th>person_emp_length</th>\n",
       "      <th>loan_intent</th>\n",
       "      <th>loan_amnt</th>\n",
       "      <th>loan_int_rate</th>\n",
       "      <th>loan_status</th>\n",
       "      <th>event_timestamp</th>\n",
       "      <th>created_timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10000</td>\n",
       "      <td>19530219_5179</td>\n",
       "      <td>76104</td>\n",
       "      <td>22</td>\n",
       "      <td>59000</td>\n",
       "      <td>RENT</td>\n",
       "      <td>123.0</td>\n",
       "      <td>PERSONAL</td>\n",
       "      <td>35000</td>\n",
       "      <td>16.02</td>\n",
       "      <td>1</td>\n",
       "      <td>2021-08-25 20:34:41.361000+00:00</td>\n",
       "      <td>2021-08-25 20:34:41.361000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10001</td>\n",
       "      <td>19520816_8737</td>\n",
       "      <td>70380</td>\n",
       "      <td>21</td>\n",
       "      <td>9600</td>\n",
       "      <td>OWN</td>\n",
       "      <td>5.0</td>\n",
       "      <td>EDUCATION</td>\n",
       "      <td>1000</td>\n",
       "      <td>11.14</td>\n",
       "      <td>0</td>\n",
       "      <td>2021-08-25 20:16:20.128000+00:00</td>\n",
       "      <td>2021-08-25 20:16:20.128000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10002</td>\n",
       "      <td>19860413_2537</td>\n",
       "      <td>97039</td>\n",
       "      <td>25</td>\n",
       "      <td>9600</td>\n",
       "      <td>MORTGAGE</td>\n",
       "      <td>1.0</td>\n",
       "      <td>MEDICAL</td>\n",
       "      <td>5500</td>\n",
       "      <td>12.87</td>\n",
       "      <td>1</td>\n",
       "      <td>2021-08-25 19:57:58.896000+00:00</td>\n",
       "      <td>2021-08-25 19:57:58.896000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10003</td>\n",
       "      <td>19760701_8090</td>\n",
       "      <td>63785</td>\n",
       "      <td>23</td>\n",
       "      <td>65500</td>\n",
       "      <td>RENT</td>\n",
       "      <td>4.0</td>\n",
       "      <td>MEDICAL</td>\n",
       "      <td>35000</td>\n",
       "      <td>15.23</td>\n",
       "      <td>1</td>\n",
       "      <td>2021-08-25 19:39:37.663000+00:00</td>\n",
       "      <td>2021-08-25 19:39:37.663000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10004</td>\n",
       "      <td>19830125_8297</td>\n",
       "      <td>82223</td>\n",
       "      <td>24</td>\n",
       "      <td>54400</td>\n",
       "      <td>RENT</td>\n",
       "      <td>8.0</td>\n",
       "      <td>MEDICAL</td>\n",
       "      <td>35000</td>\n",
       "      <td>14.27</td>\n",
       "      <td>1</td>\n",
       "      <td>2021-08-25 19:21:16.430000+00:00</td>\n",
       "      <td>2021-08-25 19:21:16.430000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28633</th>\n",
       "      <td>38633</td>\n",
       "      <td>19491126_1487</td>\n",
       "      <td>43205</td>\n",
       "      <td>57</td>\n",
       "      <td>53000</td>\n",
       "      <td>MORTGAGE</td>\n",
       "      <td>1.0</td>\n",
       "      <td>PERSONAL</td>\n",
       "      <td>5800</td>\n",
       "      <td>13.16</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-08-25 21:48:06.292000+00:00</td>\n",
       "      <td>2020-08-25 21:48:06.292000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28634</th>\n",
       "      <td>38634</td>\n",
       "      <td>19681208_6537</td>\n",
       "      <td>24872</td>\n",
       "      <td>54</td>\n",
       "      <td>120000</td>\n",
       "      <td>MORTGAGE</td>\n",
       "      <td>4.0</td>\n",
       "      <td>PERSONAL</td>\n",
       "      <td>17625</td>\n",
       "      <td>7.49</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-08-25 21:29:45.059000+00:00</td>\n",
       "      <td>2020-08-25 21:29:45.059000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28635</th>\n",
       "      <td>38635</td>\n",
       "      <td>19880422_2592</td>\n",
       "      <td>68826</td>\n",
       "      <td>65</td>\n",
       "      <td>76000</td>\n",
       "      <td>RENT</td>\n",
       "      <td>3.0</td>\n",
       "      <td>HOMEIMPROVEMENT</td>\n",
       "      <td>35000</td>\n",
       "      <td>10.99</td>\n",
       "      <td>1</td>\n",
       "      <td>2020-08-25 21:11:23.826000+00:00</td>\n",
       "      <td>2020-08-25 21:11:23.826000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28636</th>\n",
       "      <td>38636</td>\n",
       "      <td>19901017_6108</td>\n",
       "      <td>92014</td>\n",
       "      <td>56</td>\n",
       "      <td>150000</td>\n",
       "      <td>MORTGAGE</td>\n",
       "      <td>5.0</td>\n",
       "      <td>PERSONAL</td>\n",
       "      <td>15000</td>\n",
       "      <td>11.48</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-08-25 20:53:02.594000+00:00</td>\n",
       "      <td>2020-08-25 20:53:02.594000+00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28637</th>\n",
       "      <td>38637</td>\n",
       "      <td>19960703_3449</td>\n",
       "      <td>69033</td>\n",
       "      <td>66</td>\n",
       "      <td>42000</td>\n",
       "      <td>RENT</td>\n",
       "      <td>2.0</td>\n",
       "      <td>MEDICAL</td>\n",
       "      <td>6475</td>\n",
       "      <td>9.99</td>\n",
       "      <td>0</td>\n",
       "      <td>2020-08-25 20:34:41.361000+00:00</td>\n",
       "      <td>2020-08-25 20:34:41.361000+00:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>28638 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       loan_id        dob_ssn  zipcode  person_age  person_income  \\\n",
       "0        10000  19530219_5179    76104          22          59000   \n",
       "1        10001  19520816_8737    70380          21           9600   \n",
       "2        10002  19860413_2537    97039          25           9600   \n",
       "3        10003  19760701_8090    63785          23          65500   \n",
       "4        10004  19830125_8297    82223          24          54400   \n",
       "...        ...            ...      ...         ...            ...   \n",
       "28633    38633  19491126_1487    43205          57          53000   \n",
       "28634    38634  19681208_6537    24872          54         120000   \n",
       "28635    38635  19880422_2592    68826          65          76000   \n",
       "28636    38636  19901017_6108    92014          56         150000   \n",
       "28637    38637  19960703_3449    69033          66          42000   \n",
       "\n",
       "      person_home_ownership  person_emp_length      loan_intent  loan_amnt  \\\n",
       "0                      RENT              123.0         PERSONAL      35000   \n",
       "1                       OWN                5.0        EDUCATION       1000   \n",
       "2                  MORTGAGE                1.0          MEDICAL       5500   \n",
       "3                      RENT                4.0          MEDICAL      35000   \n",
       "4                      RENT                8.0          MEDICAL      35000   \n",
       "...                     ...                ...              ...        ...   \n",
       "28633              MORTGAGE                1.0         PERSONAL       5800   \n",
       "28634              MORTGAGE                4.0         PERSONAL      17625   \n",
       "28635                  RENT                3.0  HOMEIMPROVEMENT      35000   \n",
       "28636              MORTGAGE                5.0         PERSONAL      15000   \n",
       "28637                  RENT                2.0          MEDICAL       6475   \n",
       "\n",
       "       loan_int_rate  loan_status                  event_timestamp  \\\n",
       "0              16.02            1 2021-08-25 20:34:41.361000+00:00   \n",
       "1              11.14            0 2021-08-25 20:16:20.128000+00:00   \n",
       "2              12.87            1 2021-08-25 19:57:58.896000+00:00   \n",
       "3              15.23            1 2021-08-25 19:39:37.663000+00:00   \n",
       "4              14.27            1 2021-08-25 19:21:16.430000+00:00   \n",
       "...              ...          ...                              ...   \n",
       "28633          13.16            0 2020-08-25 21:48:06.292000+00:00   \n",
       "28634           7.49            0 2020-08-25 21:29:45.059000+00:00   \n",
       "28635          10.99            1 2020-08-25 21:11:23.826000+00:00   \n",
       "28636          11.48            0 2020-08-25 20:53:02.594000+00:00   \n",
       "28637           9.99            0 2020-08-25 20:34:41.361000+00:00   \n",
       "\n",
       "                     created_timestamp  \n",
       "0     2021-08-25 20:34:41.361000+00:00  \n",
       "1     2021-08-25 20:16:20.128000+00:00  \n",
       "2     2021-08-25 19:57:58.896000+00:00  \n",
       "3     2021-08-25 19:39:37.663000+00:00  \n",
       "4     2021-08-25 19:21:16.430000+00:00  \n",
       "...                                ...  \n",
       "28633 2020-08-25 21:48:06.292000+00:00  \n",
       "28634 2020-08-25 21:29:45.059000+00:00  \n",
       "28635 2020-08-25 21:11:23.826000+00:00  \n",
       "28636 2020-08-25 20:53:02.594000+00:00  \n",
       "28637 2020-08-25 20:34:41.361000+00:00  \n",
       "\n",
       "[28638 rows x 13 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "loan_df = pd.read_parquet(\"data/loan_table.parquet\")\n",
    "display(loan_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "id": "hAHAb2nS6ClR"
   },
   "outputs": [],
   "source": [
    "feast_features = [\n",
    "    \"zipcode_features:city\",\n",
    "    \"zipcode_features:state\",\n",
    "    \"zipcode_features:location_type\",\n",
    "    \"zipcode_features:tax_returns_filed\",\n",
    "    \"zipcode_features:population\",\n",
    "    \"zipcode_features:total_wages\",\n",
    "    \"credit_history:credit_card_due\",\n",
    "    \"credit_history:mortgage_due\",\n",
    "    \"credit_history:student_loan_due\",\n",
    "    \"credit_history:vehicle_loan_due\",\n",
    "    \"credit_history:hard_pulls\",\n",
    "    \"credit_history:missed_payments_2y\",\n",
    "    \"credit_history:missed_payments_1y\",\n",
    "    \"credit_history:missed_payments_6m\",\n",
    "    \"credit_history:bankruptcies\",\n",
    "]\n",
    "\n",
    "loan_w_offline_feature = fs.get_historical_features(\n",
    "    entity_df=loan_df, features=feast_features\n",
    ").to_df()\n",
    "\n",
    "# Drop some unnecessary columns for simplicity\n",
    "loan_w_offline_feature = loan_w_offline_feature.drop([\"event_timestamp\", \"created_timestamp__\", \"loan_id\", \"zipcode\", \"dob_ssn\"], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TxDFT1776XgP"
   },
   "source": [
    "Now let's take a look at the training data as it is augmented by Feast."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 540
    },
    "id": "s-w2696D6h78",
    "outputId": "f98c1dbe-1916-41bc-d543-590bac08caf5"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>person_age</th>\n",
       "      <th>person_income</th>\n",
       "      <th>person_home_ownership</th>\n",
       "      <th>person_emp_length</th>\n",
       "      <th>loan_intent</th>\n",
       "      <th>loan_amnt</th>\n",
       "      <th>loan_int_rate</th>\n",
       "      <th>loan_status</th>\n",
       "      <th>city</th>\n",
       "      <th>state</th>\n",
       "      <th>...</th>\n",
       "      <th>total_wages</th>\n",
       "      <th>credit_card_due</th>\n",
       "      <th>mortgage_due</th>\n",
       "      <th>student_loan_due</th>\n",
       "      <th>vehicle_loan_due</th>\n",
       "      <th>hard_pulls</th>\n",
       "      <th>missed_payments_2y</th>\n",
       "      <th>missed_payments_1y</th>\n",
       "      <th>missed_payments_6m</th>\n",
       "      <th>bankruptcies</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1358886</th>\n",
       "      <td>55</td>\n",
       "      <td>24543</td>\n",
       "      <td>RENT</td>\n",
       "      <td>3.0</td>\n",
       "      <td>VENTURE</td>\n",
       "      <td>4000</td>\n",
       "      <td>13.92</td>\n",
       "      <td>0</td>\n",
       "      <td>SLIDELL</td>\n",
       "      <td>LA</td>\n",
       "      <td>...</td>\n",
       "      <td>315061217</td>\n",
       "      <td>1777</td>\n",
       "      <td>690650</td>\n",
       "      <td>46372</td>\n",
       "      <td>10439</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1358815</th>\n",
       "      <td>58</td>\n",
       "      <td>20000</td>\n",
       "      <td>RENT</td>\n",
       "      <td>0.0</td>\n",
       "      <td>EDUCATION</td>\n",
       "      <td>4000</td>\n",
       "      <td>9.99</td>\n",
       "      <td>0</td>\n",
       "      <td>CHOUTEAU</td>\n",
       "      <td>OK</td>\n",
       "      <td>...</td>\n",
       "      <td>59412230</td>\n",
       "      <td>1791</td>\n",
       "      <td>462670</td>\n",
       "      <td>19421</td>\n",
       "      <td>3583</td>\n",
       "      <td>8</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1353348</th>\n",
       "      <td>64</td>\n",
       "      <td>24000</td>\n",
       "      <td>RENT</td>\n",
       "      <td>1.0</td>\n",
       "      <td>MEDICAL</td>\n",
       "      <td>3000</td>\n",
       "      <td>6.99</td>\n",
       "      <td>0</td>\n",
       "      <td>BISMARCK</td>\n",
       "      <td>ND</td>\n",
       "      <td>...</td>\n",
       "      <td>469621263</td>\n",
       "      <td>5917</td>\n",
       "      <td>1780959</td>\n",
       "      <td>11835</td>\n",
       "      <td>27910</td>\n",
       "      <td>8</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1354200</th>\n",
       "      <td>55</td>\n",
       "      <td>34000</td>\n",
       "      <td>RENT</td>\n",
       "      <td>0.0</td>\n",
       "      <td>DEBTCONSOLIDATION</td>\n",
       "      <td>12000</td>\n",
       "      <td>6.92</td>\n",
       "      <td>1</td>\n",
       "      <td>SANTA BARBARA</td>\n",
       "      <td>CA</td>\n",
       "      <td>...</td>\n",
       "      <td>24537583</td>\n",
       "      <td>8091</td>\n",
       "      <td>364271</td>\n",
       "      <td>30248</td>\n",
       "      <td>22640</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1354271</th>\n",
       "      <td>51</td>\n",
       "      <td>74628</td>\n",
       "      <td>MORTGAGE</td>\n",
       "      <td>3.0</td>\n",
       "      <td>PERSONAL</td>\n",
       "      <td>3000</td>\n",
       "      <td>13.49</td>\n",
       "      <td>0</td>\n",
       "      <td>HUNTINGTON BEACH</td>\n",
       "      <td>CA</td>\n",
       "      <td>...</td>\n",
       "      <td>19749601</td>\n",
       "      <td>3679</td>\n",
       "      <td>1659968</td>\n",
       "      <td>37582</td>\n",
       "      <td>20284</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>674285</th>\n",
       "      <td>23</td>\n",
       "      <td>74000</td>\n",
       "      <td>RENT</td>\n",
       "      <td>3.0</td>\n",
       "      <td>MEDICAL</td>\n",
       "      <td>25000</td>\n",
       "      <td>10.36</td>\n",
       "      <td>1</td>\n",
       "      <td>MANSFIELD</td>\n",
       "      <td>MO</td>\n",
       "      <td>...</td>\n",
       "      <td>33180988</td>\n",
       "      <td>5176</td>\n",
       "      <td>1089963</td>\n",
       "      <td>44642</td>\n",
       "      <td>2877</td>\n",
       "      <td>1</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>668250</th>\n",
       "      <td>21</td>\n",
       "      <td>200000</td>\n",
       "      <td>MORTGAGE</td>\n",
       "      <td>2.0</td>\n",
       "      <td>DEBTCONSOLIDATION</td>\n",
       "      <td>25000</td>\n",
       "      <td>13.99</td>\n",
       "      <td>0</td>\n",
       "      <td>SALISBURY</td>\n",
       "      <td>MD</td>\n",
       "      <td>...</td>\n",
       "      <td>470634058</td>\n",
       "      <td>5297</td>\n",
       "      <td>1288915</td>\n",
       "      <td>22471</td>\n",
       "      <td>22630</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>668321</th>\n",
       "      <td>24</td>\n",
       "      <td>200000</td>\n",
       "      <td>MORTGAGE</td>\n",
       "      <td>3.0</td>\n",
       "      <td>VENTURE</td>\n",
       "      <td>24000</td>\n",
       "      <td>7.49</td>\n",
       "      <td>0</td>\n",
       "      <td>STRUNK</td>\n",
       "      <td>KY</td>\n",
       "      <td>...</td>\n",
       "      <td>10067358</td>\n",
       "      <td>6549</td>\n",
       "      <td>22399</td>\n",
       "      <td>11806</td>\n",
       "      <td>13005</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>670025</th>\n",
       "      <td>23</td>\n",
       "      <td>215000</td>\n",
       "      <td>MORTGAGE</td>\n",
       "      <td>7.0</td>\n",
       "      <td>MEDICAL</td>\n",
       "      <td>35000</td>\n",
       "      <td>14.79</td>\n",
       "      <td>0</td>\n",
       "      <td>HAWTHORN</td>\n",
       "      <td>PA</td>\n",
       "      <td>...</td>\n",
       "      <td>5956835</td>\n",
       "      <td>9079</td>\n",
       "      <td>876038</td>\n",
       "      <td>4556</td>\n",
       "      <td>21588</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2034006</th>\n",
       "      <td>22</td>\n",
       "      <td>59000</td>\n",
       "      <td>RENT</td>\n",
       "      <td>123.0</td>\n",
       "      <td>PERSONAL</td>\n",
       "      <td>35000</td>\n",
       "      <td>16.02</td>\n",
       "      <td>1</td>\n",
       "      <td>FORT WORTH</td>\n",
       "      <td>TX</td>\n",
       "      <td>...</td>\n",
       "      <td>142325465</td>\n",
       "      <td>8419</td>\n",
       "      <td>91803</td>\n",
       "      <td>22328</td>\n",
       "      <td>15078</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>28638 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         person_age  person_income person_home_ownership  person_emp_length  \\\n",
       "1358886          55          24543                  RENT                3.0   \n",
       "1358815          58          20000                  RENT                0.0   \n",
       "1353348          64          24000                  RENT                1.0   \n",
       "1354200          55          34000                  RENT                0.0   \n",
       "1354271          51          74628              MORTGAGE                3.0   \n",
       "...             ...            ...                   ...                ...   \n",
       "674285           23          74000                  RENT                3.0   \n",
       "668250           21         200000              MORTGAGE                2.0   \n",
       "668321           24         200000              MORTGAGE                3.0   \n",
       "670025           23         215000              MORTGAGE                7.0   \n",
       "2034006          22          59000                  RENT              123.0   \n",
       "\n",
       "               loan_intent  loan_amnt  loan_int_rate  loan_status  \\\n",
       "1358886            VENTURE       4000          13.92            0   \n",
       "1358815          EDUCATION       4000           9.99            0   \n",
       "1353348            MEDICAL       3000           6.99            0   \n",
       "1354200  DEBTCONSOLIDATION      12000           6.92            1   \n",
       "1354271           PERSONAL       3000          13.49            0   \n",
       "...                    ...        ...            ...          ...   \n",
       "674285             MEDICAL      25000          10.36            1   \n",
       "668250   DEBTCONSOLIDATION      25000          13.99            0   \n",
       "668321             VENTURE      24000           7.49            0   \n",
       "670025             MEDICAL      35000          14.79            0   \n",
       "2034006           PERSONAL      35000          16.02            1   \n",
       "\n",
       "                     city state  ... total_wages  credit_card_due  \\\n",
       "1358886           SLIDELL    LA  ...   315061217             1777   \n",
       "1358815          CHOUTEAU    OK  ...    59412230             1791   \n",
       "1353348          BISMARCK    ND  ...   469621263             5917   \n",
       "1354200     SANTA BARBARA    CA  ...    24537583             8091   \n",
       "1354271  HUNTINGTON BEACH    CA  ...    19749601             3679   \n",
       "...                   ...   ...  ...         ...              ...   \n",
       "674285          MANSFIELD    MO  ...    33180988             5176   \n",
       "668250          SALISBURY    MD  ...   470634058             5297   \n",
       "668321             STRUNK    KY  ...    10067358             6549   \n",
       "670025           HAWTHORN    PA  ...     5956835             9079   \n",
       "2034006        FORT WORTH    TX  ...   142325465             8419   \n",
       "\n",
       "         mortgage_due  student_loan_due  vehicle_loan_due  hard_pulls  \\\n",
       "1358886        690650             46372             10439           5   \n",
       "1358815        462670             19421              3583           8   \n",
       "1353348       1780959             11835             27910           8   \n",
       "1354200        364271             30248             22640           2   \n",
       "1354271       1659968             37582             20284           0   \n",
       "...               ...               ...               ...         ...   \n",
       "674285        1089963             44642              2877           1   \n",
       "668250        1288915             22471             22630           0   \n",
       "668321          22399             11806             13005           0   \n",
       "670025         876038              4556             21588           0   \n",
       "2034006         91803             22328             15078           0   \n",
       "\n",
       "         missed_payments_2y  missed_payments_1y  missed_payments_6m  \\\n",
       "1358886                   1                   2                   1   \n",
       "1358815                   7                   1                   0   \n",
       "1353348                   3                   2                   1   \n",
       "1354200                   7                   3                   0   \n",
       "1354271                   1                   0                   0   \n",
       "...                     ...                 ...                 ...   \n",
       "674285                    6                   1                   0   \n",
       "668250                    5                   2                   1   \n",
       "668321                    1                   0                   0   \n",
       "670025                    1                   0                   0   \n",
       "2034006                   1                   0                   0   \n",
       "\n",
       "         bankruptcies  \n",
       "1358886             0  \n",
       "1358815             2  \n",
       "1353348             0  \n",
       "1354200             0  \n",
       "1354271             0  \n",
       "...               ...  \n",
       "674285              0  \n",
       "668250              0  \n",
       "668321              0  \n",
       "670025              0  \n",
       "2034006             0  \n",
       "\n",
       "[28638 rows x 23 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(loan_w_offline_feature)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "KpiOm00789Rd"
   },
   "outputs": [],
   "source": [
    "# Convert into Train and Validation datasets.\n",
    "import ray\n",
    "\n",
    "loan_ds = ray.data.from_pandas(loan_w_offline_feature)\n",
    "train_ds, validation_ds = loan_ds.split_proportionately([0.8])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uFNp9sUCEGc2"
   },
   "source": [
    "## Define Preprocessors\n",
    "\n",
    "[Preprocessor](https://docs.ray.io/en/latest/ray-air/getting-started.html#preprocessors) does last mile processing on Ray Datasets before feeding into training model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "id": "qravnrt9EBuW"
   },
   "outputs": [],
   "source": [
    "categorical_features = [\n",
    "    \"person_home_ownership\",\n",
    "    \"loan_intent\",\n",
    "    \"city\",\n",
    "    \"state\",\n",
    "    \"location_type\",\n",
    "]\n",
    "\n",
    "from ray.ml.preprocessors import Chain, OrdinalEncoder, SimpleImputer\n",
    "\n",
    "imputer = SimpleImputer(categorical_features, strategy=\"most_frequent\")\n",
    "encoder = OrdinalEncoder(columns=categorical_features)\n",
    "chained_preprocessor = Chain(imputer, encoder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SqPGGd6bEn0x"
   },
   "source": [
    "## Train XGBoost model using Ray AIR Trainer\n",
    "Ray AIR provides a variety of [Trainers](https://docs.ray.io/en/latest/ray-air/getting-started.html#trainer) that are integrated with popular machine learning frameworks. You can train a distributed model at scale leveraging Ray using the intuitive API `trainer.fit()`. The output is a Ray AIR [Checkpoint](https://docs.ray.io/en/latest/ray-air/getting-started.html#module-ray.ml.checkpoint), that will seamlessly transfer the workload from training to prediction. Let's take a look!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 670
    },
    "id": "995W14MdFmxl",
    "outputId": "417c1188-edf6-4310-dba8-366b71d77806"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ray/anaconda3/lib/python3.8/site-packages/xgboost_ray/main.py:131: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n",
      "  XGBOOST_LOOSE_VERSION = LooseVersion(xgboost_version)\n",
      "E0602 14:26:17.861773834    4511 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers\n",
      "/home/ray/anaconda3/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "== Status ==<br>Current time: 2022-06-02 14:26:33 (running for 00:00:14.95)<br>Memory usage on this node: 3.5/30.6 GiB<br>Using FIFO scheduling algorithm.<br>Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/18.04 GiB heap, 0.0/9.02 GiB objects<br>Result logdir: /home/ray/ray_results/XGBoostTrainer_2022-06-02_14-26-17<br>Number of trials: 1/1 (1 TERMINATED)<br><table>\n",
       "<thead>\n",
       "<tr><th>Trial name                </th><th>status    </th><th>loc               </th><th style=\"text-align: right;\">  iter</th><th style=\"text-align: right;\">  total time (s)</th><th style=\"text-align: right;\">  train-logloss</th><th style=\"text-align: right;\">  train-error</th><th style=\"text-align: right;\">  validation-logloss</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "<tr><td>XGBoostTrainer_a3a2c_00000</td><td>TERMINATED</td><td>172.31.71.98:12634</td><td style=\"text-align: right;\">   100</td><td style=\"text-align: right;\">         11.9561</td><td style=\"text-align: right;\">      0.0578837</td><td style=\"text-align: right;\">    0.0127019</td><td style=\"text-align: right;\">            0.225994</td></tr>\n",
       "</tbody>\n",
       "</table><br><br>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "(GBDTTrainable pid=12634) 2022-06-02 14:26:23,018\tINFO main.py:980 -- [RayXGBoost] Created 1 new actors (1 total actors). Waiting until actors are ready for training.\n",
      "(GBDTTrainable pid=12634) 2022-06-02 14:26:25,230\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
      "(GBDTTrainable pid=12634) E0602 14:26:25.231635524   12691 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers\n",
      "(_RemoteRayXGBoostActor pid=12769) [14:26:25] task [xgboost.ray]:139712002042896 got new rank 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result for XGBoostTrainer_a3a2c_00000:\n",
      "  date: 2022-06-02_14-26-26\n",
      "  done: false\n",
      "  experiment_id: 14b63d641b8e4583b3551b1af113ec6d\n",
      "  hostname: ip-172-31-71-98\n",
      "  iterations_since_restore: 1\n",
      "  node_ip: 172.31.71.98\n",
      "  pid: 12634\n",
      "  should_checkpoint: true\n",
      "  time_since_restore: 5.432286262512207\n",
      "  time_this_iter_s: 5.432286262512207\n",
      "  time_total_s: 5.432286262512207\n",
      "  timestamp: 1654205186\n",
      "  timesteps_since_restore: 0\n",
      "  train-error: 0.09502400698384984\n",
      "  train-logloss: 0.5147884634112437\n",
      "  training_iteration: 1\n",
      "  trial_id: a3a2c_00000\n",
      "  validation-error: 0.1627094972067039\n",
      "  validation-logloss: 0.5557328870197414\n",
      "  warmup_time: 0.004194974899291992\n",
      "  \n",
      "Result for XGBoostTrainer_a3a2c_00000:\n",
      "  date: 2022-06-02_14-26-31\n",
      "  done: false\n",
      "  experiment_id: 14b63d641b8e4583b3551b1af113ec6d\n",
      "  hostname: ip-172-31-71-98\n",
      "  iterations_since_restore: 81\n",
      "  node_ip: 172.31.71.98\n",
      "  pid: 12634\n",
      "  should_checkpoint: true\n",
      "  time_since_restore: 10.460175275802612\n",
      "  time_this_iter_s: 0.03925156593322754\n",
      "  time_total_s: 10.460175275802612\n",
      "  timestamp: 1654205191\n",
      "  timesteps_since_restore: 0\n",
      "  train-error: 0.01802706241815801\n",
      "  train-logloss: 0.07058723671627426\n",
      "  training_iteration: 81\n",
      "  trial_id: a3a2c_00000\n",
      "  validation-error: 0.0824022346368715\n",
      "  validation-logloss: 0.22556984905196217\n",
      "  warmup_time: 0.004194974899291992\n",
      "  \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "(GBDTTrainable pid=12634) 2022-06-02 14:26:32,801\tINFO main.py:1516 -- [RayXGBoost] Finished XGBoost training on training data with total N=22,910 in 9.81 seconds (7.57 pure XGBoost training time).\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result for XGBoostTrainer_a3a2c_00000:\n",
      "  date: 2022-06-02_14-26-32\n",
      "  done: true\n",
      "  experiment_id: 14b63d641b8e4583b3551b1af113ec6d\n",
      "  experiment_tag: '0'\n",
      "  hostname: ip-172-31-71-98\n",
      "  iterations_since_restore: 100\n",
      "  node_ip: 172.31.71.98\n",
      "  pid: 12634\n",
      "  should_checkpoint: true\n",
      "  time_since_restore: 11.956074953079224\n",
      "  time_this_iter_s: 0.022900104522705078\n",
      "  time_total_s: 11.956074953079224\n",
      "  timestamp: 1654205192\n",
      "  timesteps_since_restore: 0\n",
      "  train-error: 0.01270187690964644\n",
      "  train-logloss: 0.05788368908741939\n",
      "  training_iteration: 100\n",
      "  trial_id: a3a2c_00000\n",
      "  validation-error: 0.0825768156424581\n",
      "  validation-logloss: 0.22599449588356768\n",
      "  warmup_time: 0.004194974899291992\n",
      "  \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-06-02 14:26:33,481\tINFO tune.py:752 -- Total run time: 15.62 seconds (14.94 seconds for the tuning loop).\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'checkpoint'"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "LABEL = \"loan_status\"\n",
    "CHECKPOINT_PATH = \"checkpoint\"\n",
    "NUM_WORKERS = 1  # Change this based on the resources in the cluster.\n",
    "\n",
    "\n",
    "from ray.ml.train.integrations.xgboost import XGBoostTrainer\n",
    "from ray.air.config import ScalingConfig\n",
    "params = {\n",
    "    \"tree_method\": \"approx\",\n",
    "    \"objective\": \"binary:logistic\",\n",
    "    \"eval_metric\": [\"logloss\", \"error\"],\n",
    "}\n",
    "\n",
    "trainer = XGBoostTrainer(\n",
    "    scaling_config=ScalingConfig(\n",
    "        num_workers=NUM_WORKERS,\n",
    "        use_gpu=0,\n",
    "    ),\n",
    "    label_column=LABEL,\n",
    "    params=params,\n",
    "    datasets={\"train\": train_ds, \"validation\": validation_ds},\n",
    "    preprocessor=chained_preprocessor,\n",
    "    num_boost_round=100,\n",
    ")\n",
    "checkpoint = trainer.fit().checkpoint\n",
    "# This saves the checkpoint onto disk\n",
    "checkpoint.to_directory(CHECKPOINT_PATH)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QJr73gOvKBza"
   },
   "source": [
    "## Inference\n",
    "Now from the Checkpoint object we obtained from last session, we can construct a Ray AIR [Predictor](https://docs.ray.io/en/latest/ray-air/getting-started.html#predictors) that encapsulates everything needed for inference.\n",
    "\n",
    "The API for using Predictor is also very intuitive - simply call `Predictor.predict()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "id": "wE8ielQlKYSi"
   },
   "outputs": [],
   "source": [
    "from ray.ml.checkpoint import Checkpoint\n",
    "from ray.ml.predictors.integrations.xgboost import XGBoostPredictor\n",
    "predictor = XGBoostPredictor.from_checkpoint(Checkpoint.from_directory(CHECKPOINT_PATH))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "id": "K9Y9UiD3KqSW"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_4511/2153939661.py:23: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.\n",
      "Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
      "  loan_request_df = pd.DataFrame.from_dict(loan_request_dict, dtype=np.float)\n",
      "/tmp/ipykernel_4511/2153939661.py:23: FutureWarning: Could not cast to float64, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised.\n",
      "  loan_request_df = pd.DataFrame.from_dict(loan_request_dict, dtype=np.float)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>person_age</th>\n",
       "      <th>person_income</th>\n",
       "      <th>person_home_ownership</th>\n",
       "      <th>person_emp_length</th>\n",
       "      <th>loan_intent</th>\n",
       "      <th>loan_amnt</th>\n",
       "      <th>loan_int_rate</th>\n",
       "      <th>state</th>\n",
       "      <th>population</th>\n",
       "      <th>location_type</th>\n",
       "      <th>...</th>\n",
       "      <th>tax_returns_filed</th>\n",
       "      <th>student_loan_due</th>\n",
       "      <th>missed_payments_1y</th>\n",
       "      <th>hard_pulls</th>\n",
       "      <th>mortgage_due</th>\n",
       "      <th>bankruptcies</th>\n",
       "      <th>credit_card_due</th>\n",
       "      <th>missed_payments_2y</th>\n",
       "      <th>missed_payments_6m</th>\n",
       "      <th>vehicle_loan_due</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>133.0</td>\n",
       "      <td>59000.0</td>\n",
       "      <td>RENT</td>\n",
       "      <td>123.0</td>\n",
       "      <td>PERSONAL</td>\n",
       "      <td>35000.0</td>\n",
       "      <td>16.02</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1 rows × 22 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   person_age  person_income person_home_ownership  person_emp_length  \\\n",
       "0       133.0        59000.0                  RENT              123.0   \n",
       "\n",
       "  loan_intent  loan_amnt  loan_int_rate  state  population  location_type  \\\n",
       "0    PERSONAL    35000.0          16.02    NaN         NaN            NaN   \n",
       "\n",
       "   ...  tax_returns_filed  student_loan_due  missed_payments_1y  hard_pulls  \\\n",
       "0  ...                NaN               NaN                 NaN         NaN   \n",
       "\n",
       "   mortgage_due  bankruptcies  credit_card_due  missed_payments_2y  \\\n",
       "0           NaN           NaN              NaN                 NaN   \n",
       "\n",
       "   missed_payments_6m  vehicle_loan_due  \n",
       "0                 NaN               NaN  \n",
       "\n",
       "[1 rows x 22 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import numpy as np\n",
    "## Now let's do some prediciton.\n",
    "loan_request_dict = {\n",
    "    \"zipcode\": [76104],\n",
    "    \"dob_ssn\": [\"19630621_4278\"],\n",
    "    \"person_age\": [133],\n",
    "    \"person_income\": [59000],\n",
    "    \"person_home_ownership\": [\"RENT\"],\n",
    "    \"person_emp_length\": [123.0],\n",
    "    \"loan_intent\": [\"PERSONAL\"],\n",
    "    \"loan_amnt\": [35000],\n",
    "    \"loan_int_rate\": [16.02],\n",
    "}\n",
    "\n",
    "# Now augment the request with online features.\n",
    "zipcode = loan_request_dict[\"zipcode\"][0]\n",
    "dob_ssn = loan_request_dict[\"dob_ssn\"][0]\n",
    "online_features =  fs.get_online_features(\n",
    "    entity_rows=[{\"zipcode\": zipcode, \"dob_ssn\": dob_ssn}],\n",
    "    features=feast_features,\n",
    ").to_dict()\n",
    "loan_request_dict.update(online_features)\n",
    "loan_request_df = pd.DataFrame.from_dict(loan_request_dict, dtype=np.float)\n",
    "loan_request_df = loan_request_df.drop([\"zipcode\", \"dob_ssn\"], axis=1)\n",
    "display(loan_request_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "id": "eS7_n1GPLL1e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loan rejected!\n"
     ]
    }
   ],
   "source": [
    "# Run through our predictor using `Predictor.predict()` API.\n",
    "loan_result = np.round(predictor.predict(loan_request_df)[\"predictions\"][0])\n",
    "\n",
    "if loan_result == 0:\n",
    "    print(\"Loan approved!\")\n",
    "elif loan_result == 1:\n",
    "    print(\"Loan rejected!\")"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "air + feast",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}