ray/doc/source/ray-air/examples/feast_example.ipynb
2022-06-07 14:55:42 -07:00

1510 lines
58 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Integrate Ray Air with Feast feature store"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# !pip install feast==0.20.1 ray[air]>=1.13 xgboost_ray"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "INyNIaeB1Kza"
},
"source": [
"In this example, we showcase how to use Ray Air with Feast feature store, leveraging both historical features for training a model and online features for inference.\n",
"\n",
"The task is adapted from [Feast credit scoring tutorial](https://github.com/feast-dev/feast-aws-credit-scoring-tutorial). In this example, we train a xgboost model and run some prediction on an incoming loan request to see if it is approved or rejected. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sBC9CCrpzQLF"
},
"source": [
"Let's first set up our workspace and prepare the data to work with."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "DcPIskZlzSal"
},
"outputs": [],
"source": [
"import os\n",
"WORKING_DIR = os.path.expanduser(\"~/ray-air-feast-example/\")\n",
"%env WORKING_DIR=$WORKING_DIR"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "BcyCKjV3zTCK",
"outputId": "afdfa24d-e5ce-49db-c904-e961e1eb910c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2022-06-02 14:22:50-- https://github.com/ray-project/air-sample-data/raw/main/air-feast-example.zip\n",
"Resolving github.com (github.com)... 192.30.255.113\n",
"Connecting to github.com (github.com)|192.30.255.113|:443... connected.\n",
"HTTP request sent, awaiting response... 302 Found\n",
"Location: https://raw.githubusercontent.com/ray-project/air-sample-data/main/air-feast-example.zip [following]\n",
"--2022-06-02 14:22:50-- https://raw.githubusercontent.com/ray-project/air-sample-data/main/air-feast-example.zip\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 23715107 (23M) [application/zip]\n",
"Saving to: air-feast-example.zip\n",
"\n",
"air-feast-example.z 100%[===================>] 22.62M 114MB/s in 0.2s \n",
"\n",
"2022-06-02 14:22:51 (114 MB/s) - air-feast-example.zip saved [23715107/23715107]\n",
"\n",
"Archive: air-feast-example.zip\n",
" creating: air-feast-example/\n",
" creating: air-feast-example/feature_repo/\n",
" inflating: air-feast-example/feature_repo/.DS_Store \n",
" extracting: air-feast-example/feature_repo/__init__.py \n",
" inflating: air-feast-example/feature_repo/features.py \n",
" creating: air-feast-example/feature_repo/data/\n",
" inflating: air-feast-example/feature_repo/data/.DS_Store \n",
" inflating: air-feast-example/feature_repo/data/credit_history_sample.csv \n",
" inflating: air-feast-example/feature_repo/data/zipcode_table_sample.csv \n",
" inflating: air-feast-example/feature_repo/data/credit_history.parquet \n",
" inflating: air-feast-example/feature_repo/data/zipcode_table.parquet \n",
" inflating: air-feast-example/feature_repo/feature_store.yaml \n",
" inflating: air-feast-example/.DS_Store \n",
" creating: air-feast-example/data/\n",
" inflating: air-feast-example/data/loan_table.parquet \n",
" inflating: air-feast-example/data/loan_table_sample.csv \n"
]
}
],
"source": [
"! mkdir -p $WORKING_DIR\n",
"! wget --no-check-certificate https://github.com/ray-project/air-sample-data/raw/main/air-feast-example.zip\n",
"! unzip air-feast-example.zip \n",
"! mv air-feast-example/* $WORKING_DIR\n",
"%cd $WORKING_DIR"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "iNbC-Qqi3Lq_",
"outputId": "99576086-12dd-4f96-fb51-de40b77b15ce"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"data feature_repo\n"
]
}
],
"source": [
"! ls"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c_3wlEus4dYO"
},
"source": [
"There is already a feature repository set up in `feature_repo/`. It isn't necessary to create a new feature repository, but it can be done using the following command: `feast init -t local feature_repo`.\n",
"\n",
"Now let's take a look at the schema in Feast feature store, which is defined by `feature_repo/features.py`. There are mainly two features: zipcode_feature and credit_history, both are generated from parquet files - `feature_repo/data/zipcode_table.parquet` and `feature_repo/data/credit_history.parquet`."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "5VGLhPLLzlGW",
"outputId": "a3f3499e-c140-4ceb-a66d-2f1a6b8a2142"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mdatetime\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m timedelta\n",
"\n",
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mfeast\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m (Entity, Field, FeatureView, FileSource, ValueType)\n",
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mfeast\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mtypes\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m Float32, Int64, String\n",
"\n",
"\n",
"zipcode = Entity(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mzipcode\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, value_type=Int64)\n",
"\n",
"zipcode_source = FileSource(\n",
" path=\u001b[33m\"\u001b[39;49;00m\u001b[33mfeature_repo/data/zipcode_table.parquet\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
" timestamp_field=\u001b[33m\"\u001b[39;49;00m\u001b[33mevent_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
" created_timestamp_column=\u001b[33m\"\u001b[39;49;00m\u001b[33mcreated_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
")\n",
"\n",
"zipcode_features = FeatureView(\n",
" name=\u001b[33m\"\u001b[39;49;00m\u001b[33mzipcode_features\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
" entities=[\u001b[33m\"\u001b[39;49;00m\u001b[33mzipcode\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m],\n",
" ttl=timedelta(days=\u001b[34m3650\u001b[39;49;00m),\n",
" schema=[\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mcity\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=String),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mstate\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=String),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mlocation_type\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=String),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mtax_returns_filed\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mpopulation\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mtotal_wages\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" ],\n",
" source=zipcode_source,\n",
")\n",
"\n",
"dob_ssn = Entity(\n",
" name=\u001b[33m\"\u001b[39;49;00m\u001b[33mdob_ssn\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
" value_type=ValueType.STRING,\n",
" description=\u001b[33m\"\u001b[39;49;00m\u001b[33mDate of birth and last four digits of social security number\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
")\n",
"\n",
"credit_history_source = FileSource(\n",
" path=\u001b[33m\"\u001b[39;49;00m\u001b[33mfeature_repo/data/credit_history.parquet\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
" timestamp_field=\u001b[33m\"\u001b[39;49;00m\u001b[33mevent_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
" created_timestamp_column=\u001b[33m\"\u001b[39;49;00m\u001b[33mcreated_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
")\n",
"\n",
"credit_history = FeatureView(\n",
" name=\u001b[33m\"\u001b[39;49;00m\u001b[33mcredit_history\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
" entities=[\u001b[33m\"\u001b[39;49;00m\u001b[33mdob_ssn\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m],\n",
" ttl=timedelta(days=\u001b[34m90\u001b[39;49;00m),\n",
" schema=[\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mcredit_card_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmortgage_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mstudent_loan_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mvehicle_loan_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mhard_pulls\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmissed_payments_2y\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmissed_payments_1y\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmissed_payments_6m\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mbankruptcies\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
" ],\n",
" source=credit_history_source,\n",
")\n"
]
}
],
"source": [
"!pygmentize feature_repo/features.py"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HQmrfEV33_SM"
},
"source": [
"Deploy the above defined feature store by running `apply` from within the feature_repo/ folder."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "SbL_EbMC2MFS",
"outputId": "13b07f1f-d52a-4c4e-a73f-f5478c0304de"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Feast is an open source project that collects anonymized error reporting and usage statistics. To opt out or learn more see https://docs.feast.dev/reference/usage\n",
"Created entity \u001b[1m\u001b[32mdob_ssn\u001b[0m\n",
"Created entity \u001b[1m\u001b[32mzipcode\u001b[0m\n",
"Created feature view \u001b[1m\u001b[32mcredit_history\u001b[0m\n",
"Created feature view \u001b[1m\u001b[32mzipcode_features\u001b[0m\n",
"\n",
"Created sqlite table \u001b[1m\u001b[32mfeature_repo_credit_history\u001b[0m\n",
"Created sqlite table \u001b[1m\u001b[32mfeature_repo_zipcode_features\u001b[0m\n",
"\n"
]
}
],
"source": [
"! (cd feature_repo && feast apply)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "D-9Kr-kdzg1r",
"outputId": "beabd2f7-c1a6-4fe3-b087-b35b7b9ab56b"
},
"outputs": [],
"source": [
"import feast\n",
"fs = feast.FeatureStore(repo_path=\"feature_repo\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5nR8uNE8z-YQ"
},
"source": [
"## Generate training data\n",
"On top of the features in Feast, we also have labeled training data at `data/loan_table.parquet`. At the time of training, loan table will be passed into Feast as an entity dataframe for training data generation. Feast will intelligently join credit_history and zipcode_feature tables to create relevant feature vectors to augment the training data."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "twBCJMzVzV0X",
"outputId": "efb41c7a-2802-4169-906e-7db1c37d8c8e"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>loan_id</th>\n",
" <th>dob_ssn</th>\n",
" <th>zipcode</th>\n",
" <th>person_age</th>\n",
" <th>person_income</th>\n",
" <th>person_home_ownership</th>\n",
" <th>person_emp_length</th>\n",
" <th>loan_intent</th>\n",
" <th>loan_amnt</th>\n",
" <th>loan_int_rate</th>\n",
" <th>loan_status</th>\n",
" <th>event_timestamp</th>\n",
" <th>created_timestamp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>10000</td>\n",
" <td>19530219_5179</td>\n",
" <td>76104</td>\n",
" <td>22</td>\n",
" <td>59000</td>\n",
" <td>RENT</td>\n",
" <td>123.0</td>\n",
" <td>PERSONAL</td>\n",
" <td>35000</td>\n",
" <td>16.02</td>\n",
" <td>1</td>\n",
" <td>2021-08-25 20:34:41.361000+00:00</td>\n",
" <td>2021-08-25 20:34:41.361000+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>10001</td>\n",
" <td>19520816_8737</td>\n",
" <td>70380</td>\n",
" <td>21</td>\n",
" <td>9600</td>\n",
" <td>OWN</td>\n",
" <td>5.0</td>\n",
" <td>EDUCATION</td>\n",
" <td>1000</td>\n",
" <td>11.14</td>\n",
" <td>0</td>\n",
" <td>2021-08-25 20:16:20.128000+00:00</td>\n",
" <td>2021-08-25 20:16:20.128000+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>10002</td>\n",
" <td>19860413_2537</td>\n",
" <td>97039</td>\n",
" <td>25</td>\n",
" <td>9600</td>\n",
" <td>MORTGAGE</td>\n",
" <td>1.0</td>\n",
" <td>MEDICAL</td>\n",
" <td>5500</td>\n",
" <td>12.87</td>\n",
" <td>1</td>\n",
" <td>2021-08-25 19:57:58.896000+00:00</td>\n",
" <td>2021-08-25 19:57:58.896000+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>10003</td>\n",
" <td>19760701_8090</td>\n",
" <td>63785</td>\n",
" <td>23</td>\n",
" <td>65500</td>\n",
" <td>RENT</td>\n",
" <td>4.0</td>\n",
" <td>MEDICAL</td>\n",
" <td>35000</td>\n",
" <td>15.23</td>\n",
" <td>1</td>\n",
" <td>2021-08-25 19:39:37.663000+00:00</td>\n",
" <td>2021-08-25 19:39:37.663000+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>10004</td>\n",
" <td>19830125_8297</td>\n",
" <td>82223</td>\n",
" <td>24</td>\n",
" <td>54400</td>\n",
" <td>RENT</td>\n",
" <td>8.0</td>\n",
" <td>MEDICAL</td>\n",
" <td>35000</td>\n",
" <td>14.27</td>\n",
" <td>1</td>\n",
" <td>2021-08-25 19:21:16.430000+00:00</td>\n",
" <td>2021-08-25 19:21:16.430000+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28633</th>\n",
" <td>38633</td>\n",
" <td>19491126_1487</td>\n",
" <td>43205</td>\n",
" <td>57</td>\n",
" <td>53000</td>\n",
" <td>MORTGAGE</td>\n",
" <td>1.0</td>\n",
" <td>PERSONAL</td>\n",
" <td>5800</td>\n",
" <td>13.16</td>\n",
" <td>0</td>\n",
" <td>2020-08-25 21:48:06.292000+00:00</td>\n",
" <td>2020-08-25 21:48:06.292000+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28634</th>\n",
" <td>38634</td>\n",
" <td>19681208_6537</td>\n",
" <td>24872</td>\n",
" <td>54</td>\n",
" <td>120000</td>\n",
" <td>MORTGAGE</td>\n",
" <td>4.0</td>\n",
" <td>PERSONAL</td>\n",
" <td>17625</td>\n",
" <td>7.49</td>\n",
" <td>0</td>\n",
" <td>2020-08-25 21:29:45.059000+00:00</td>\n",
" <td>2020-08-25 21:29:45.059000+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28635</th>\n",
" <td>38635</td>\n",
" <td>19880422_2592</td>\n",
" <td>68826</td>\n",
" <td>65</td>\n",
" <td>76000</td>\n",
" <td>RENT</td>\n",
" <td>3.0</td>\n",
" <td>HOMEIMPROVEMENT</td>\n",
" <td>35000</td>\n",
" <td>10.99</td>\n",
" <td>1</td>\n",
" <td>2020-08-25 21:11:23.826000+00:00</td>\n",
" <td>2020-08-25 21:11:23.826000+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28636</th>\n",
" <td>38636</td>\n",
" <td>19901017_6108</td>\n",
" <td>92014</td>\n",
" <td>56</td>\n",
" <td>150000</td>\n",
" <td>MORTGAGE</td>\n",
" <td>5.0</td>\n",
" <td>PERSONAL</td>\n",
" <td>15000</td>\n",
" <td>11.48</td>\n",
" <td>0</td>\n",
" <td>2020-08-25 20:53:02.594000+00:00</td>\n",
" <td>2020-08-25 20:53:02.594000+00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28637</th>\n",
" <td>38637</td>\n",
" <td>19960703_3449</td>\n",
" <td>69033</td>\n",
" <td>66</td>\n",
" <td>42000</td>\n",
" <td>RENT</td>\n",
" <td>2.0</td>\n",
" <td>MEDICAL</td>\n",
" <td>6475</td>\n",
" <td>9.99</td>\n",
" <td>0</td>\n",
" <td>2020-08-25 20:34:41.361000+00:00</td>\n",
" <td>2020-08-25 20:34:41.361000+00:00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>28638 rows × 13 columns</p>\n",
"</div>"
],
"text/plain": [
" loan_id dob_ssn zipcode person_age person_income \\\n",
"0 10000 19530219_5179 76104 22 59000 \n",
"1 10001 19520816_8737 70380 21 9600 \n",
"2 10002 19860413_2537 97039 25 9600 \n",
"3 10003 19760701_8090 63785 23 65500 \n",
"4 10004 19830125_8297 82223 24 54400 \n",
"... ... ... ... ... ... \n",
"28633 38633 19491126_1487 43205 57 53000 \n",
"28634 38634 19681208_6537 24872 54 120000 \n",
"28635 38635 19880422_2592 68826 65 76000 \n",
"28636 38636 19901017_6108 92014 56 150000 \n",
"28637 38637 19960703_3449 69033 66 42000 \n",
"\n",
" person_home_ownership person_emp_length loan_intent loan_amnt \\\n",
"0 RENT 123.0 PERSONAL 35000 \n",
"1 OWN 5.0 EDUCATION 1000 \n",
"2 MORTGAGE 1.0 MEDICAL 5500 \n",
"3 RENT 4.0 MEDICAL 35000 \n",
"4 RENT 8.0 MEDICAL 35000 \n",
"... ... ... ... ... \n",
"28633 MORTGAGE 1.0 PERSONAL 5800 \n",
"28634 MORTGAGE 4.0 PERSONAL 17625 \n",
"28635 RENT 3.0 HOMEIMPROVEMENT 35000 \n",
"28636 MORTGAGE 5.0 PERSONAL 15000 \n",
"28637 RENT 2.0 MEDICAL 6475 \n",
"\n",
" loan_int_rate loan_status event_timestamp \\\n",
"0 16.02 1 2021-08-25 20:34:41.361000+00:00 \n",
"1 11.14 0 2021-08-25 20:16:20.128000+00:00 \n",
"2 12.87 1 2021-08-25 19:57:58.896000+00:00 \n",
"3 15.23 1 2021-08-25 19:39:37.663000+00:00 \n",
"4 14.27 1 2021-08-25 19:21:16.430000+00:00 \n",
"... ... ... ... \n",
"28633 13.16 0 2020-08-25 21:48:06.292000+00:00 \n",
"28634 7.49 0 2020-08-25 21:29:45.059000+00:00 \n",
"28635 10.99 1 2020-08-25 21:11:23.826000+00:00 \n",
"28636 11.48 0 2020-08-25 20:53:02.594000+00:00 \n",
"28637 9.99 0 2020-08-25 20:34:41.361000+00:00 \n",
"\n",
" created_timestamp \n",
"0 2021-08-25 20:34:41.361000+00:00 \n",
"1 2021-08-25 20:16:20.128000+00:00 \n",
"2 2021-08-25 19:57:58.896000+00:00 \n",
"3 2021-08-25 19:39:37.663000+00:00 \n",
"4 2021-08-25 19:21:16.430000+00:00 \n",
"... ... \n",
"28633 2020-08-25 21:48:06.292000+00:00 \n",
"28634 2020-08-25 21:29:45.059000+00:00 \n",
"28635 2020-08-25 21:11:23.826000+00:00 \n",
"28636 2020-08-25 20:53:02.594000+00:00 \n",
"28637 2020-08-25 20:34:41.361000+00:00 \n",
"\n",
"[28638 rows x 13 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"loan_df = pd.read_parquet(\"data/loan_table.parquet\")\n",
"display(loan_df)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "hAHAb2nS6ClR"
},
"outputs": [],
"source": [
"feast_features = [\n",
" \"zipcode_features:city\",\n",
" \"zipcode_features:state\",\n",
" \"zipcode_features:location_type\",\n",
" \"zipcode_features:tax_returns_filed\",\n",
" \"zipcode_features:population\",\n",
" \"zipcode_features:total_wages\",\n",
" \"credit_history:credit_card_due\",\n",
" \"credit_history:mortgage_due\",\n",
" \"credit_history:student_loan_due\",\n",
" \"credit_history:vehicle_loan_due\",\n",
" \"credit_history:hard_pulls\",\n",
" \"credit_history:missed_payments_2y\",\n",
" \"credit_history:missed_payments_1y\",\n",
" \"credit_history:missed_payments_6m\",\n",
" \"credit_history:bankruptcies\",\n",
"]\n",
"\n",
"loan_w_offline_feature = fs.get_historical_features(\n",
" entity_df=loan_df, features=feast_features\n",
").to_df()\n",
"\n",
"# Drop some unnecessary columns for simplicity\n",
"loan_w_offline_feature = loan_w_offline_feature.drop([\"event_timestamp\", \"created_timestamp__\", \"loan_id\", \"zipcode\", \"dob_ssn\"], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TxDFT1776XgP"
},
"source": [
"Now let's take a look at the training data as it is augmented by Feast."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 540
},
"id": "s-w2696D6h78",
"outputId": "f98c1dbe-1916-41bc-d543-590bac08caf5"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>person_age</th>\n",
" <th>person_income</th>\n",
" <th>person_home_ownership</th>\n",
" <th>person_emp_length</th>\n",
" <th>loan_intent</th>\n",
" <th>loan_amnt</th>\n",
" <th>loan_int_rate</th>\n",
" <th>loan_status</th>\n",
" <th>city</th>\n",
" <th>state</th>\n",
" <th>...</th>\n",
" <th>total_wages</th>\n",
" <th>credit_card_due</th>\n",
" <th>mortgage_due</th>\n",
" <th>student_loan_due</th>\n",
" <th>vehicle_loan_due</th>\n",
" <th>hard_pulls</th>\n",
" <th>missed_payments_2y</th>\n",
" <th>missed_payments_1y</th>\n",
" <th>missed_payments_6m</th>\n",
" <th>bankruptcies</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1358886</th>\n",
" <td>55</td>\n",
" <td>24543</td>\n",
" <td>RENT</td>\n",
" <td>3.0</td>\n",
" <td>VENTURE</td>\n",
" <td>4000</td>\n",
" <td>13.92</td>\n",
" <td>0</td>\n",
" <td>SLIDELL</td>\n",
" <td>LA</td>\n",
" <td>...</td>\n",
" <td>315061217</td>\n",
" <td>1777</td>\n",
" <td>690650</td>\n",
" <td>46372</td>\n",
" <td>10439</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1358815</th>\n",
" <td>58</td>\n",
" <td>20000</td>\n",
" <td>RENT</td>\n",
" <td>0.0</td>\n",
" <td>EDUCATION</td>\n",
" <td>4000</td>\n",
" <td>9.99</td>\n",
" <td>0</td>\n",
" <td>CHOUTEAU</td>\n",
" <td>OK</td>\n",
" <td>...</td>\n",
" <td>59412230</td>\n",
" <td>1791</td>\n",
" <td>462670</td>\n",
" <td>19421</td>\n",
" <td>3583</td>\n",
" <td>8</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1353348</th>\n",
" <td>64</td>\n",
" <td>24000</td>\n",
" <td>RENT</td>\n",
" <td>1.0</td>\n",
" <td>MEDICAL</td>\n",
" <td>3000</td>\n",
" <td>6.99</td>\n",
" <td>0</td>\n",
" <td>BISMARCK</td>\n",
" <td>ND</td>\n",
" <td>...</td>\n",
" <td>469621263</td>\n",
" <td>5917</td>\n",
" <td>1780959</td>\n",
" <td>11835</td>\n",
" <td>27910</td>\n",
" <td>8</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1354200</th>\n",
" <td>55</td>\n",
" <td>34000</td>\n",
" <td>RENT</td>\n",
" <td>0.0</td>\n",
" <td>DEBTCONSOLIDATION</td>\n",
" <td>12000</td>\n",
" <td>6.92</td>\n",
" <td>1</td>\n",
" <td>SANTA BARBARA</td>\n",
" <td>CA</td>\n",
" <td>...</td>\n",
" <td>24537583</td>\n",
" <td>8091</td>\n",
" <td>364271</td>\n",
" <td>30248</td>\n",
" <td>22640</td>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1354271</th>\n",
" <td>51</td>\n",
" <td>74628</td>\n",
" <td>MORTGAGE</td>\n",
" <td>3.0</td>\n",
" <td>PERSONAL</td>\n",
" <td>3000</td>\n",
" <td>13.49</td>\n",
" <td>0</td>\n",
" <td>HUNTINGTON BEACH</td>\n",
" <td>CA</td>\n",
" <td>...</td>\n",
" <td>19749601</td>\n",
" <td>3679</td>\n",
" <td>1659968</td>\n",
" <td>37582</td>\n",
" <td>20284</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>674285</th>\n",
" <td>23</td>\n",
" <td>74000</td>\n",
" <td>RENT</td>\n",
" <td>3.0</td>\n",
" <td>MEDICAL</td>\n",
" <td>25000</td>\n",
" <td>10.36</td>\n",
" <td>1</td>\n",
" <td>MANSFIELD</td>\n",
" <td>MO</td>\n",
" <td>...</td>\n",
" <td>33180988</td>\n",
" <td>5176</td>\n",
" <td>1089963</td>\n",
" <td>44642</td>\n",
" <td>2877</td>\n",
" <td>1</td>\n",
" <td>6</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>668250</th>\n",
" <td>21</td>\n",
" <td>200000</td>\n",
" <td>MORTGAGE</td>\n",
" <td>2.0</td>\n",
" <td>DEBTCONSOLIDATION</td>\n",
" <td>25000</td>\n",
" <td>13.99</td>\n",
" <td>0</td>\n",
" <td>SALISBURY</td>\n",
" <td>MD</td>\n",
" <td>...</td>\n",
" <td>470634058</td>\n",
" <td>5297</td>\n",
" <td>1288915</td>\n",
" <td>22471</td>\n",
" <td>22630</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>668321</th>\n",
" <td>24</td>\n",
" <td>200000</td>\n",
" <td>MORTGAGE</td>\n",
" <td>3.0</td>\n",
" <td>VENTURE</td>\n",
" <td>24000</td>\n",
" <td>7.49</td>\n",
" <td>0</td>\n",
" <td>STRUNK</td>\n",
" <td>KY</td>\n",
" <td>...</td>\n",
" <td>10067358</td>\n",
" <td>6549</td>\n",
" <td>22399</td>\n",
" <td>11806</td>\n",
" <td>13005</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>670025</th>\n",
" <td>23</td>\n",
" <td>215000</td>\n",
" <td>MORTGAGE</td>\n",
" <td>7.0</td>\n",
" <td>MEDICAL</td>\n",
" <td>35000</td>\n",
" <td>14.79</td>\n",
" <td>0</td>\n",
" <td>HAWTHORN</td>\n",
" <td>PA</td>\n",
" <td>...</td>\n",
" <td>5956835</td>\n",
" <td>9079</td>\n",
" <td>876038</td>\n",
" <td>4556</td>\n",
" <td>21588</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2034006</th>\n",
" <td>22</td>\n",
" <td>59000</td>\n",
" <td>RENT</td>\n",
" <td>123.0</td>\n",
" <td>PERSONAL</td>\n",
" <td>35000</td>\n",
" <td>16.02</td>\n",
" <td>1</td>\n",
" <td>FORT WORTH</td>\n",
" <td>TX</td>\n",
" <td>...</td>\n",
" <td>142325465</td>\n",
" <td>8419</td>\n",
" <td>91803</td>\n",
" <td>22328</td>\n",
" <td>15078</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>28638 rows × 23 columns</p>\n",
"</div>"
],
"text/plain": [
" person_age person_income person_home_ownership person_emp_length \\\n",
"1358886 55 24543 RENT 3.0 \n",
"1358815 58 20000 RENT 0.0 \n",
"1353348 64 24000 RENT 1.0 \n",
"1354200 55 34000 RENT 0.0 \n",
"1354271 51 74628 MORTGAGE 3.0 \n",
"... ... ... ... ... \n",
"674285 23 74000 RENT 3.0 \n",
"668250 21 200000 MORTGAGE 2.0 \n",
"668321 24 200000 MORTGAGE 3.0 \n",
"670025 23 215000 MORTGAGE 7.0 \n",
"2034006 22 59000 RENT 123.0 \n",
"\n",
" loan_intent loan_amnt loan_int_rate loan_status \\\n",
"1358886 VENTURE 4000 13.92 0 \n",
"1358815 EDUCATION 4000 9.99 0 \n",
"1353348 MEDICAL 3000 6.99 0 \n",
"1354200 DEBTCONSOLIDATION 12000 6.92 1 \n",
"1354271 PERSONAL 3000 13.49 0 \n",
"... ... ... ... ... \n",
"674285 MEDICAL 25000 10.36 1 \n",
"668250 DEBTCONSOLIDATION 25000 13.99 0 \n",
"668321 VENTURE 24000 7.49 0 \n",
"670025 MEDICAL 35000 14.79 0 \n",
"2034006 PERSONAL 35000 16.02 1 \n",
"\n",
" city state ... total_wages credit_card_due \\\n",
"1358886 SLIDELL LA ... 315061217 1777 \n",
"1358815 CHOUTEAU OK ... 59412230 1791 \n",
"1353348 BISMARCK ND ... 469621263 5917 \n",
"1354200 SANTA BARBARA CA ... 24537583 8091 \n",
"1354271 HUNTINGTON BEACH CA ... 19749601 3679 \n",
"... ... ... ... ... ... \n",
"674285 MANSFIELD MO ... 33180988 5176 \n",
"668250 SALISBURY MD ... 470634058 5297 \n",
"668321 STRUNK KY ... 10067358 6549 \n",
"670025 HAWTHORN PA ... 5956835 9079 \n",
"2034006 FORT WORTH TX ... 142325465 8419 \n",
"\n",
" mortgage_due student_loan_due vehicle_loan_due hard_pulls \\\n",
"1358886 690650 46372 10439 5 \n",
"1358815 462670 19421 3583 8 \n",
"1353348 1780959 11835 27910 8 \n",
"1354200 364271 30248 22640 2 \n",
"1354271 1659968 37582 20284 0 \n",
"... ... ... ... ... \n",
"674285 1089963 44642 2877 1 \n",
"668250 1288915 22471 22630 0 \n",
"668321 22399 11806 13005 0 \n",
"670025 876038 4556 21588 0 \n",
"2034006 91803 22328 15078 0 \n",
"\n",
" missed_payments_2y missed_payments_1y missed_payments_6m \\\n",
"1358886 1 2 1 \n",
"1358815 7 1 0 \n",
"1353348 3 2 1 \n",
"1354200 7 3 0 \n",
"1354271 1 0 0 \n",
"... ... ... ... \n",
"674285 6 1 0 \n",
"668250 5 2 1 \n",
"668321 1 0 0 \n",
"670025 1 0 0 \n",
"2034006 1 0 0 \n",
"\n",
" bankruptcies \n",
"1358886 0 \n",
"1358815 2 \n",
"1353348 0 \n",
"1354200 0 \n",
"1354271 0 \n",
"... ... \n",
"674285 0 \n",
"668250 0 \n",
"668321 0 \n",
"670025 0 \n",
"2034006 0 \n",
"\n",
"[28638 rows x 23 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(loan_w_offline_feature)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "KpiOm00789Rd"
},
"outputs": [],
"source": [
"# Convert into Train and Validation datasets.\n",
"import ray\n",
"\n",
"loan_ds = ray.data.from_pandas(loan_w_offline_feature)\n",
"train_ds, validation_ds = loan_ds.split_proportionately([0.8])\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uFNp9sUCEGc2"
},
"source": [
"## Define Preprocessors\n",
"\n",
"[Preprocessor](https://docs.ray.io/en/latest/ray-air/getting-started.html#preprocessors) does last mile processing on Ray Datasets before feeding into training model."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"id": "qravnrt9EBuW"
},
"outputs": [],
"source": [
"categorical_features = [\n",
" \"person_home_ownership\",\n",
" \"loan_intent\",\n",
" \"city\",\n",
" \"state\",\n",
" \"location_type\",\n",
"]\n",
"\n",
"from ray.ml.preprocessors import Chain, OrdinalEncoder, SimpleImputer\n",
"\n",
"imputer = SimpleImputer(categorical_features, strategy=\"most_frequent\")\n",
"encoder = OrdinalEncoder(columns=categorical_features)\n",
"chained_preprocessor = Chain(imputer, encoder)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SqPGGd6bEn0x"
},
"source": [
"## Train XGBoost model using Ray Air Trainer\n",
"Ray Air provides a variety of [Trainers](https://docs.ray.io/en/latest/ray-air/getting-started.html#trainer) that are integrated with popular machine learning frameworks. You can train a distributed model at scale leveraging Ray using the intuitive API `trainer.fit()`. The output is a Ray Air [Checkpoint](https://docs.ray.io/en/latest/ray-air/getting-started.html#module-ray.ml.checkpoint), that will seamlessly transfer the workload from training to prediction. Let's take a look!"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 670
},
"id": "995W14MdFmxl",
"outputId": "417c1188-edf6-4310-dba8-366b71d77806"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ray/anaconda3/lib/python3.8/site-packages/xgboost_ray/main.py:131: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n",
" XGBOOST_LOOSE_VERSION = LooseVersion(xgboost_version)\n",
"E0602 14:26:17.861773834 4511 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers\n",
"/home/ray/anaconda3/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
},
{
"data": {
"text/html": [
"== Status ==<br>Current time: 2022-06-02 14:26:33 (running for 00:00:14.95)<br>Memory usage on this node: 3.5/30.6 GiB<br>Using FIFO scheduling algorithm.<br>Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/18.04 GiB heap, 0.0/9.02 GiB objects<br>Result logdir: /home/ray/ray_results/XGBoostTrainer_2022-06-02_14-26-17<br>Number of trials: 1/1 (1 TERMINATED)<br><table>\n",
"<thead>\n",
"<tr><th>Trial name </th><th>status </th><th>loc </th><th style=\"text-align: right;\"> iter</th><th style=\"text-align: right;\"> total time (s)</th><th style=\"text-align: right;\"> train-logloss</th><th style=\"text-align: right;\"> train-error</th><th style=\"text-align: right;\"> validation-logloss</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"<tr><td>XGBoostTrainer_a3a2c_00000</td><td>TERMINATED</td><td>172.31.71.98:12634</td><td style=\"text-align: right;\"> 100</td><td style=\"text-align: right;\"> 11.9561</td><td style=\"text-align: right;\"> 0.0578837</td><td style=\"text-align: right;\"> 0.0127019</td><td style=\"text-align: right;\"> 0.225994</td></tr>\n",
"</tbody>\n",
"</table><br><br>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(GBDTTrainable pid=12634) 2022-06-02 14:26:23,018\tINFO main.py:980 -- [RayXGBoost] Created 1 new actors (1 total actors). Waiting until actors are ready for training.\n",
"(GBDTTrainable pid=12634) 2022-06-02 14:26:25,230\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
"(GBDTTrainable pid=12634) E0602 14:26:25.231635524 12691 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers\n",
"(_RemoteRayXGBoostActor pid=12769) [14:26:25] task [xgboost.ray]:139712002042896 got new rank 0\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for XGBoostTrainer_a3a2c_00000:\n",
" date: 2022-06-02_14-26-26\n",
" done: false\n",
" experiment_id: 14b63d641b8e4583b3551b1af113ec6d\n",
" hostname: ip-172-31-71-98\n",
" iterations_since_restore: 1\n",
" node_ip: 172.31.71.98\n",
" pid: 12634\n",
" should_checkpoint: true\n",
" time_since_restore: 5.432286262512207\n",
" time_this_iter_s: 5.432286262512207\n",
" time_total_s: 5.432286262512207\n",
" timestamp: 1654205186\n",
" timesteps_since_restore: 0\n",
" train-error: 0.09502400698384984\n",
" train-logloss: 0.5147884634112437\n",
" training_iteration: 1\n",
" trial_id: a3a2c_00000\n",
" validation-error: 0.1627094972067039\n",
" validation-logloss: 0.5557328870197414\n",
" warmup_time: 0.004194974899291992\n",
" \n",
"Result for XGBoostTrainer_a3a2c_00000:\n",
" date: 2022-06-02_14-26-31\n",
" done: false\n",
" experiment_id: 14b63d641b8e4583b3551b1af113ec6d\n",
" hostname: ip-172-31-71-98\n",
" iterations_since_restore: 81\n",
" node_ip: 172.31.71.98\n",
" pid: 12634\n",
" should_checkpoint: true\n",
" time_since_restore: 10.460175275802612\n",
" time_this_iter_s: 0.03925156593322754\n",
" time_total_s: 10.460175275802612\n",
" timestamp: 1654205191\n",
" timesteps_since_restore: 0\n",
" train-error: 0.01802706241815801\n",
" train-logloss: 0.07058723671627426\n",
" training_iteration: 81\n",
" trial_id: a3a2c_00000\n",
" validation-error: 0.0824022346368715\n",
" validation-logloss: 0.22556984905196217\n",
" warmup_time: 0.004194974899291992\n",
" \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"(GBDTTrainable pid=12634) 2022-06-02 14:26:32,801\tINFO main.py:1516 -- [RayXGBoost] Finished XGBoost training on training data with total N=22,910 in 9.81 seconds (7.57 pure XGBoost training time).\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Result for XGBoostTrainer_a3a2c_00000:\n",
" date: 2022-06-02_14-26-32\n",
" done: true\n",
" experiment_id: 14b63d641b8e4583b3551b1af113ec6d\n",
" experiment_tag: '0'\n",
" hostname: ip-172-31-71-98\n",
" iterations_since_restore: 100\n",
" node_ip: 172.31.71.98\n",
" pid: 12634\n",
" should_checkpoint: true\n",
" time_since_restore: 11.956074953079224\n",
" time_this_iter_s: 0.022900104522705078\n",
" time_total_s: 11.956074953079224\n",
" timestamp: 1654205192\n",
" timesteps_since_restore: 0\n",
" train-error: 0.01270187690964644\n",
" train-logloss: 0.05788368908741939\n",
" training_iteration: 100\n",
" trial_id: a3a2c_00000\n",
" validation-error: 0.0825768156424581\n",
" validation-logloss: 0.22599449588356768\n",
" warmup_time: 0.004194974899291992\n",
" \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-06-02 14:26:33,481\tINFO tune.py:752 -- Total run time: 15.62 seconds (14.94 seconds for the tuning loop).\n"
]
},
{
"data": {
"text/plain": [
"'checkpoint'"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LABEL = \"loan_status\"\n",
"CHECKPOINT_PATH = \"checkpoint\"\n",
"NUM_WORKERS = 1 # Change this based on the resources in the cluster.\n",
"\n",
"\n",
"from ray.ml.train.integrations.xgboost import XGBoostTrainer\n",
"params = {\n",
" \"tree_method\": \"approx\",\n",
" \"objective\": \"binary:logistic\",\n",
" \"eval_metric\": [\"logloss\", \"error\"],\n",
"}\n",
"\n",
"trainer = XGBoostTrainer(\n",
" scaling_config={\n",
" \"num_workers\": NUM_WORKERS,\n",
" \"use_gpu\": 0,\n",
" },\n",
" label_column=LABEL,\n",
" params=params,\n",
" datasets={\"train\": train_ds, \"validation\": validation_ds},\n",
" preprocessor=chained_preprocessor,\n",
" num_boost_round=100,\n",
")\n",
"checkpoint = trainer.fit().checkpoint\n",
"# This saves the checkpoint onto disk\n",
"checkpoint.to_directory(CHECKPOINT_PATH)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QJr73gOvKBza"
},
"source": [
"## Inference\n",
"Now from the Checkpoint object we obtained from last session, we can construct a Ray Air [Predictor](https://docs.ray.io/en/latest/ray-air/getting-started.html#predictors) that encapsulates everything needed for inference.\n",
"\n",
"The API for using Predictor is also very intuitive - simply call `Predictor.predict()`."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "wE8ielQlKYSi"
},
"outputs": [],
"source": [
"from ray.ml.checkpoint import Checkpoint\n",
"from ray.ml.predictors.integrations.xgboost import XGBoostPredictor\n",
"predictor = XGBoostPredictor.from_checkpoint(Checkpoint.from_directory(CHECKPOINT_PATH))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"id": "K9Y9UiD3KqSW"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_4511/2153939661.py:23: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.\n",
"Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
" loan_request_df = pd.DataFrame.from_dict(loan_request_dict, dtype=np.float)\n",
"/tmp/ipykernel_4511/2153939661.py:23: FutureWarning: Could not cast to float64, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised.\n",
" loan_request_df = pd.DataFrame.from_dict(loan_request_dict, dtype=np.float)\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>person_age</th>\n",
" <th>person_income</th>\n",
" <th>person_home_ownership</th>\n",
" <th>person_emp_length</th>\n",
" <th>loan_intent</th>\n",
" <th>loan_amnt</th>\n",
" <th>loan_int_rate</th>\n",
" <th>state</th>\n",
" <th>population</th>\n",
" <th>location_type</th>\n",
" <th>...</th>\n",
" <th>tax_returns_filed</th>\n",
" <th>student_loan_due</th>\n",
" <th>missed_payments_1y</th>\n",
" <th>hard_pulls</th>\n",
" <th>mortgage_due</th>\n",
" <th>bankruptcies</th>\n",
" <th>credit_card_due</th>\n",
" <th>missed_payments_2y</th>\n",
" <th>missed_payments_6m</th>\n",
" <th>vehicle_loan_due</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>133.0</td>\n",
" <td>59000.0</td>\n",
" <td>RENT</td>\n",
" <td>123.0</td>\n",
" <td>PERSONAL</td>\n",
" <td>35000.0</td>\n",
" <td>16.02</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1 rows × 22 columns</p>\n",
"</div>"
],
"text/plain": [
" person_age person_income person_home_ownership person_emp_length \\\n",
"0 133.0 59000.0 RENT 123.0 \n",
"\n",
" loan_intent loan_amnt loan_int_rate state population location_type \\\n",
"0 PERSONAL 35000.0 16.02 NaN NaN NaN \n",
"\n",
" ... tax_returns_filed student_loan_due missed_payments_1y hard_pulls \\\n",
"0 ... NaN NaN NaN NaN \n",
"\n",
" mortgage_due bankruptcies credit_card_due missed_payments_2y \\\n",
"0 NaN NaN NaN NaN \n",
"\n",
" missed_payments_6m vehicle_loan_due \n",
"0 NaN NaN \n",
"\n",
"[1 rows x 22 columns]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import numpy as np\n",
"## Now let's do some prediciton.\n",
"loan_request_dict = {\n",
" \"zipcode\": [76104],\n",
" \"dob_ssn\": [\"19630621_4278\"],\n",
" \"person_age\": [133],\n",
" \"person_income\": [59000],\n",
" \"person_home_ownership\": [\"RENT\"],\n",
" \"person_emp_length\": [123.0],\n",
" \"loan_intent\": [\"PERSONAL\"],\n",
" \"loan_amnt\": [35000],\n",
" \"loan_int_rate\": [16.02],\n",
"}\n",
"\n",
"# Now augment the request with online features.\n",
"zipcode = loan_request_dict[\"zipcode\"][0]\n",
"dob_ssn = loan_request_dict[\"dob_ssn\"][0]\n",
"online_features = fs.get_online_features(\n",
" entity_rows=[{\"zipcode\": zipcode, \"dob_ssn\": dob_ssn}],\n",
" features=feast_features,\n",
").to_dict()\n",
"loan_request_dict.update(online_features)\n",
"loan_request_df = pd.DataFrame.from_dict(loan_request_dict, dtype=np.float)\n",
"loan_request_df = loan_request_df.drop([\"zipcode\", \"dob_ssn\"], axis=1)\n",
"display(loan_request_df)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"id": "eS7_n1GPLL1e"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loan rejected!\n"
]
}
],
"source": [
"# Run through our predictor using `Predictor.predict()` API.\n",
"loan_result = np.round(predictor.predict(loan_request_df)[\"predictions\"][0])\n",
"\n",
"if loan_result == 0:\n",
" print(\"Loan approved!\")\n",
"elif loan_result == 1:\n",
" print(\"Loan rejected!\")"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "air + feast",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}