mirror of
https://github.com/vale981/ray
synced 2025-03-07 02:51:39 -05:00

Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
1495 lines
57 KiB
Text
1495 lines
57 KiB
Text
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Integrate Ray AIR with Feast feature store"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# !pip install feast==0.20.1 ray[air]>=1.13 xgboost_ray"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "INyNIaeB1Kza"
|
||
},
|
||
"source": [
|
||
"In this example, we showcase how to use Ray AIR with Feast feature store, leveraging both historical features for training a model and online features for inference.\n",
|
||
"\n",
|
||
"The task is adapted from [Feast credit scoring tutorial](https://github.com/feast-dev/feast-aws-credit-scoring-tutorial). In this example, we train a xgboost model and run some prediction on an incoming loan request to see if it is approved or rejected. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "sBC9CCrpzQLF"
|
||
},
|
||
"source": [
|
||
"Let's first set up our workspace and prepare the data to work with."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {
|
||
"id": "DcPIskZlzSal"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"WORKING_DIR = os.path.expanduser(\"~/ray-air-feast-example/\")\n",
|
||
"%env WORKING_DIR=$WORKING_DIR"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"id": "BcyCKjV3zTCK",
|
||
"outputId": "afdfa24d-e5ce-49db-c904-e961e1eb910c"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"--2022-06-02 14:22:50-- https://github.com/ray-project/air-sample-data/raw/main/air-feast-example.zip\n",
|
||
"Resolving github.com (github.com)... 192.30.255.113\n",
|
||
"Connecting to github.com (github.com)|192.30.255.113|:443... connected.\n",
|
||
"HTTP request sent, awaiting response... 302 Found\n",
|
||
"Location: https://raw.githubusercontent.com/ray-project/air-sample-data/main/air-feast-example.zip [following]\n",
|
||
"--2022-06-02 14:22:50-- https://raw.githubusercontent.com/ray-project/air-sample-data/main/air-feast-example.zip\n",
|
||
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n",
|
||
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
|
||
"HTTP request sent, awaiting response... 200 OK\n",
|
||
"Length: 23715107 (23M) [application/zip]\n",
|
||
"Saving to: ‘air-feast-example.zip’\n",
|
||
"\n",
|
||
"air-feast-example.z 100%[===================>] 22.62M 114MB/s in 0.2s \n",
|
||
"\n",
|
||
"2022-06-02 14:22:51 (114 MB/s) - ‘air-feast-example.zip’ saved [23715107/23715107]\n",
|
||
"\n",
|
||
"Archive: air-feast-example.zip\n",
|
||
" creating: air-feast-example/\n",
|
||
" creating: air-feast-example/feature_repo/\n",
|
||
" inflating: air-feast-example/feature_repo/.DS_Store \n",
|
||
" extracting: air-feast-example/feature_repo/__init__.py \n",
|
||
" inflating: air-feast-example/feature_repo/features.py \n",
|
||
" creating: air-feast-example/feature_repo/data/\n",
|
||
" inflating: air-feast-example/feature_repo/data/.DS_Store \n",
|
||
" inflating: air-feast-example/feature_repo/data/credit_history_sample.csv \n",
|
||
" inflating: air-feast-example/feature_repo/data/zipcode_table_sample.csv \n",
|
||
" inflating: air-feast-example/feature_repo/data/credit_history.parquet \n",
|
||
" inflating: air-feast-example/feature_repo/data/zipcode_table.parquet \n",
|
||
" inflating: air-feast-example/feature_repo/feature_store.yaml \n",
|
||
" inflating: air-feast-example/.DS_Store \n",
|
||
" creating: air-feast-example/data/\n",
|
||
" inflating: air-feast-example/data/loan_table.parquet \n",
|
||
" inflating: air-feast-example/data/loan_table_sample.csv \n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"! mkdir -p $WORKING_DIR\n",
|
||
"! wget --no-check-certificate https://github.com/ray-project/air-sample-data/raw/main/air-feast-example.zip\n",
|
||
"! unzip air-feast-example.zip \n",
|
||
"! mv air-feast-example/* $WORKING_DIR\n",
|
||
"%cd $WORKING_DIR"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"id": "iNbC-Qqi3Lq_",
|
||
"outputId": "99576086-12dd-4f96-fb51-de40b77b15ce"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"data feature_repo\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"! ls"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "c_3wlEus4dYO"
|
||
},
|
||
"source": [
|
||
"There is already a feature repository set up in `feature_repo/`. It isn't necessary to create a new feature repository, but it can be done using the following command: `feast init -t local feature_repo`.\n",
|
||
"\n",
|
||
"Now let's take a look at the schema in Feast feature store, which is defined by `feature_repo/features.py`. There are mainly two features: zipcode_feature and credit_history, both are generated from parquet files - `feature_repo/data/zipcode_table.parquet` and `feature_repo/data/credit_history.parquet`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"id": "5VGLhPLLzlGW",
|
||
"outputId": "a3f3499e-c140-4ceb-a66d-2f1a6b8a2142"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mdatetime\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m timedelta\n",
|
||
"\n",
|
||
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mfeast\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m (Entity, Field, FeatureView, FileSource, ValueType)\n",
|
||
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mfeast\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mtypes\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m Float32, Int64, String\n",
|
||
"\n",
|
||
"\n",
|
||
"zipcode = Entity(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mzipcode\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, value_type=Int64)\n",
|
||
"\n",
|
||
"zipcode_source = FileSource(\n",
|
||
" path=\u001b[33m\"\u001b[39;49;00m\u001b[33mfeature_repo/data/zipcode_table.parquet\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
|
||
" timestamp_field=\u001b[33m\"\u001b[39;49;00m\u001b[33mevent_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
|
||
" created_timestamp_column=\u001b[33m\"\u001b[39;49;00m\u001b[33mcreated_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
|
||
")\n",
|
||
"\n",
|
||
"zipcode_features = FeatureView(\n",
|
||
" name=\u001b[33m\"\u001b[39;49;00m\u001b[33mzipcode_features\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
|
||
" entities=[\u001b[33m\"\u001b[39;49;00m\u001b[33mzipcode\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m],\n",
|
||
" ttl=timedelta(days=\u001b[34m3650\u001b[39;49;00m),\n",
|
||
" schema=[\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mcity\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=String),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mstate\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=String),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mlocation_type\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=String),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mtax_returns_filed\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mpopulation\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mtotal_wages\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" ],\n",
|
||
" source=zipcode_source,\n",
|
||
")\n",
|
||
"\n",
|
||
"dob_ssn = Entity(\n",
|
||
" name=\u001b[33m\"\u001b[39;49;00m\u001b[33mdob_ssn\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
|
||
" value_type=ValueType.STRING,\n",
|
||
" description=\u001b[33m\"\u001b[39;49;00m\u001b[33mDate of birth and last four digits of social security number\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
|
||
")\n",
|
||
"\n",
|
||
"credit_history_source = FileSource(\n",
|
||
" path=\u001b[33m\"\u001b[39;49;00m\u001b[33mfeature_repo/data/credit_history.parquet\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
|
||
" timestamp_field=\u001b[33m\"\u001b[39;49;00m\u001b[33mevent_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
|
||
" created_timestamp_column=\u001b[33m\"\u001b[39;49;00m\u001b[33mcreated_timestamp\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
|
||
")\n",
|
||
"\n",
|
||
"credit_history = FeatureView(\n",
|
||
" name=\u001b[33m\"\u001b[39;49;00m\u001b[33mcredit_history\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m,\n",
|
||
" entities=[\u001b[33m\"\u001b[39;49;00m\u001b[33mdob_ssn\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m],\n",
|
||
" ttl=timedelta(days=\u001b[34m90\u001b[39;49;00m),\n",
|
||
" schema=[\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mcredit_card_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmortgage_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mstudent_loan_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mvehicle_loan_due\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mhard_pulls\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmissed_payments_2y\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmissed_payments_1y\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mmissed_payments_6m\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" Field(name=\u001b[33m\"\u001b[39;49;00m\u001b[33mbankruptcies\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, dtype=Int64),\n",
|
||
" ],\n",
|
||
" source=credit_history_source,\n",
|
||
")\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"!pygmentize feature_repo/features.py"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "HQmrfEV33_SM"
|
||
},
|
||
"source": [
|
||
"Deploy the above defined feature store by running `apply` from within the feature_repo/ folder."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"id": "SbL_EbMC2MFS",
|
||
"outputId": "13b07f1f-d52a-4c4e-a73f-f5478c0304de"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"! cd feature_repo && feast apply"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/"
|
||
},
|
||
"id": "D-9Kr-kdzg1r",
|
||
"outputId": "beabd2f7-c1a6-4fe3-b087-b35b7b9ab56b"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import feast\n",
|
||
"fs = feast.FeatureStore(repo_path=\"feature_repo\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "5nR8uNE8z-YQ"
|
||
},
|
||
"source": [
|
||
"## Generate training data\n",
|
||
"On top of the features in Feast, we also have labeled training data at `data/loan_table.parquet`. At the time of training, loan table will be passed into Feast as an entity dataframe for training data generation. Feast will intelligently join credit_history and zipcode_feature tables to create relevant feature vectors to augment the training data."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 424
|
||
},
|
||
"id": "twBCJMzVzV0X",
|
||
"outputId": "efb41c7a-2802-4169-906e-7db1c37d8c8e"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>loan_id</th>\n",
|
||
" <th>dob_ssn</th>\n",
|
||
" <th>zipcode</th>\n",
|
||
" <th>person_age</th>\n",
|
||
" <th>person_income</th>\n",
|
||
" <th>person_home_ownership</th>\n",
|
||
" <th>person_emp_length</th>\n",
|
||
" <th>loan_intent</th>\n",
|
||
" <th>loan_amnt</th>\n",
|
||
" <th>loan_int_rate</th>\n",
|
||
" <th>loan_status</th>\n",
|
||
" <th>event_timestamp</th>\n",
|
||
" <th>created_timestamp</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>10000</td>\n",
|
||
" <td>19530219_5179</td>\n",
|
||
" <td>76104</td>\n",
|
||
" <td>22</td>\n",
|
||
" <td>59000</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>123.0</td>\n",
|
||
" <td>PERSONAL</td>\n",
|
||
" <td>35000</td>\n",
|
||
" <td>16.02</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>2021-08-25 20:34:41.361000+00:00</td>\n",
|
||
" <td>2021-08-25 20:34:41.361000+00:00</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>10001</td>\n",
|
||
" <td>19520816_8737</td>\n",
|
||
" <td>70380</td>\n",
|
||
" <td>21</td>\n",
|
||
" <td>9600</td>\n",
|
||
" <td>OWN</td>\n",
|
||
" <td>5.0</td>\n",
|
||
" <td>EDUCATION</td>\n",
|
||
" <td>1000</td>\n",
|
||
" <td>11.14</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2021-08-25 20:16:20.128000+00:00</td>\n",
|
||
" <td>2021-08-25 20:16:20.128000+00:00</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>10002</td>\n",
|
||
" <td>19860413_2537</td>\n",
|
||
" <td>97039</td>\n",
|
||
" <td>25</td>\n",
|
||
" <td>9600</td>\n",
|
||
" <td>MORTGAGE</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>MEDICAL</td>\n",
|
||
" <td>5500</td>\n",
|
||
" <td>12.87</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>2021-08-25 19:57:58.896000+00:00</td>\n",
|
||
" <td>2021-08-25 19:57:58.896000+00:00</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>10003</td>\n",
|
||
" <td>19760701_8090</td>\n",
|
||
" <td>63785</td>\n",
|
||
" <td>23</td>\n",
|
||
" <td>65500</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>4.0</td>\n",
|
||
" <td>MEDICAL</td>\n",
|
||
" <td>35000</td>\n",
|
||
" <td>15.23</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>2021-08-25 19:39:37.663000+00:00</td>\n",
|
||
" <td>2021-08-25 19:39:37.663000+00:00</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>10004</td>\n",
|
||
" <td>19830125_8297</td>\n",
|
||
" <td>82223</td>\n",
|
||
" <td>24</td>\n",
|
||
" <td>54400</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>8.0</td>\n",
|
||
" <td>MEDICAL</td>\n",
|
||
" <td>35000</td>\n",
|
||
" <td>14.27</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>2021-08-25 19:21:16.430000+00:00</td>\n",
|
||
" <td>2021-08-25 19:21:16.430000+00:00</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>28633</th>\n",
|
||
" <td>38633</td>\n",
|
||
" <td>19491126_1487</td>\n",
|
||
" <td>43205</td>\n",
|
||
" <td>57</td>\n",
|
||
" <td>53000</td>\n",
|
||
" <td>MORTGAGE</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>PERSONAL</td>\n",
|
||
" <td>5800</td>\n",
|
||
" <td>13.16</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2020-08-25 21:48:06.292000+00:00</td>\n",
|
||
" <td>2020-08-25 21:48:06.292000+00:00</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>28634</th>\n",
|
||
" <td>38634</td>\n",
|
||
" <td>19681208_6537</td>\n",
|
||
" <td>24872</td>\n",
|
||
" <td>54</td>\n",
|
||
" <td>120000</td>\n",
|
||
" <td>MORTGAGE</td>\n",
|
||
" <td>4.0</td>\n",
|
||
" <td>PERSONAL</td>\n",
|
||
" <td>17625</td>\n",
|
||
" <td>7.49</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2020-08-25 21:29:45.059000+00:00</td>\n",
|
||
" <td>2020-08-25 21:29:45.059000+00:00</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>28635</th>\n",
|
||
" <td>38635</td>\n",
|
||
" <td>19880422_2592</td>\n",
|
||
" <td>68826</td>\n",
|
||
" <td>65</td>\n",
|
||
" <td>76000</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>3.0</td>\n",
|
||
" <td>HOMEIMPROVEMENT</td>\n",
|
||
" <td>35000</td>\n",
|
||
" <td>10.99</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>2020-08-25 21:11:23.826000+00:00</td>\n",
|
||
" <td>2020-08-25 21:11:23.826000+00:00</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>28636</th>\n",
|
||
" <td>38636</td>\n",
|
||
" <td>19901017_6108</td>\n",
|
||
" <td>92014</td>\n",
|
||
" <td>56</td>\n",
|
||
" <td>150000</td>\n",
|
||
" <td>MORTGAGE</td>\n",
|
||
" <td>5.0</td>\n",
|
||
" <td>PERSONAL</td>\n",
|
||
" <td>15000</td>\n",
|
||
" <td>11.48</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2020-08-25 20:53:02.594000+00:00</td>\n",
|
||
" <td>2020-08-25 20:53:02.594000+00:00</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>28637</th>\n",
|
||
" <td>38637</td>\n",
|
||
" <td>19960703_3449</td>\n",
|
||
" <td>69033</td>\n",
|
||
" <td>66</td>\n",
|
||
" <td>42000</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>2.0</td>\n",
|
||
" <td>MEDICAL</td>\n",
|
||
" <td>6475</td>\n",
|
||
" <td>9.99</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2020-08-25 20:34:41.361000+00:00</td>\n",
|
||
" <td>2020-08-25 20:34:41.361000+00:00</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>28638 rows × 13 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" loan_id dob_ssn zipcode person_age person_income \\\n",
|
||
"0 10000 19530219_5179 76104 22 59000 \n",
|
||
"1 10001 19520816_8737 70380 21 9600 \n",
|
||
"2 10002 19860413_2537 97039 25 9600 \n",
|
||
"3 10003 19760701_8090 63785 23 65500 \n",
|
||
"4 10004 19830125_8297 82223 24 54400 \n",
|
||
"... ... ... ... ... ... \n",
|
||
"28633 38633 19491126_1487 43205 57 53000 \n",
|
||
"28634 38634 19681208_6537 24872 54 120000 \n",
|
||
"28635 38635 19880422_2592 68826 65 76000 \n",
|
||
"28636 38636 19901017_6108 92014 56 150000 \n",
|
||
"28637 38637 19960703_3449 69033 66 42000 \n",
|
||
"\n",
|
||
" person_home_ownership person_emp_length loan_intent loan_amnt \\\n",
|
||
"0 RENT 123.0 PERSONAL 35000 \n",
|
||
"1 OWN 5.0 EDUCATION 1000 \n",
|
||
"2 MORTGAGE 1.0 MEDICAL 5500 \n",
|
||
"3 RENT 4.0 MEDICAL 35000 \n",
|
||
"4 RENT 8.0 MEDICAL 35000 \n",
|
||
"... ... ... ... ... \n",
|
||
"28633 MORTGAGE 1.0 PERSONAL 5800 \n",
|
||
"28634 MORTGAGE 4.0 PERSONAL 17625 \n",
|
||
"28635 RENT 3.0 HOMEIMPROVEMENT 35000 \n",
|
||
"28636 MORTGAGE 5.0 PERSONAL 15000 \n",
|
||
"28637 RENT 2.0 MEDICAL 6475 \n",
|
||
"\n",
|
||
" loan_int_rate loan_status event_timestamp \\\n",
|
||
"0 16.02 1 2021-08-25 20:34:41.361000+00:00 \n",
|
||
"1 11.14 0 2021-08-25 20:16:20.128000+00:00 \n",
|
||
"2 12.87 1 2021-08-25 19:57:58.896000+00:00 \n",
|
||
"3 15.23 1 2021-08-25 19:39:37.663000+00:00 \n",
|
||
"4 14.27 1 2021-08-25 19:21:16.430000+00:00 \n",
|
||
"... ... ... ... \n",
|
||
"28633 13.16 0 2020-08-25 21:48:06.292000+00:00 \n",
|
||
"28634 7.49 0 2020-08-25 21:29:45.059000+00:00 \n",
|
||
"28635 10.99 1 2020-08-25 21:11:23.826000+00:00 \n",
|
||
"28636 11.48 0 2020-08-25 20:53:02.594000+00:00 \n",
|
||
"28637 9.99 0 2020-08-25 20:34:41.361000+00:00 \n",
|
||
"\n",
|
||
" created_timestamp \n",
|
||
"0 2021-08-25 20:34:41.361000+00:00 \n",
|
||
"1 2021-08-25 20:16:20.128000+00:00 \n",
|
||
"2 2021-08-25 19:57:58.896000+00:00 \n",
|
||
"3 2021-08-25 19:39:37.663000+00:00 \n",
|
||
"4 2021-08-25 19:21:16.430000+00:00 \n",
|
||
"... ... \n",
|
||
"28633 2020-08-25 21:48:06.292000+00:00 \n",
|
||
"28634 2020-08-25 21:29:45.059000+00:00 \n",
|
||
"28635 2020-08-25 21:11:23.826000+00:00 \n",
|
||
"28636 2020-08-25 20:53:02.594000+00:00 \n",
|
||
"28637 2020-08-25 20:34:41.361000+00:00 \n",
|
||
"\n",
|
||
"[28638 rows x 13 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"import pandas as pd\n",
|
||
"loan_df = pd.read_parquet(\"data/loan_table.parquet\")\n",
|
||
"display(loan_df)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {
|
||
"id": "hAHAb2nS6ClR"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"feast_features = [\n",
|
||
" \"zipcode_features:city\",\n",
|
||
" \"zipcode_features:state\",\n",
|
||
" \"zipcode_features:location_type\",\n",
|
||
" \"zipcode_features:tax_returns_filed\",\n",
|
||
" \"zipcode_features:population\",\n",
|
||
" \"zipcode_features:total_wages\",\n",
|
||
" \"credit_history:credit_card_due\",\n",
|
||
" \"credit_history:mortgage_due\",\n",
|
||
" \"credit_history:student_loan_due\",\n",
|
||
" \"credit_history:vehicle_loan_due\",\n",
|
||
" \"credit_history:hard_pulls\",\n",
|
||
" \"credit_history:missed_payments_2y\",\n",
|
||
" \"credit_history:missed_payments_1y\",\n",
|
||
" \"credit_history:missed_payments_6m\",\n",
|
||
" \"credit_history:bankruptcies\",\n",
|
||
"]\n",
|
||
"\n",
|
||
"loan_w_offline_feature = fs.get_historical_features(\n",
|
||
" entity_df=loan_df, features=feast_features\n",
|
||
").to_df()\n",
|
||
"\n",
|
||
"# Drop some unnecessary columns for simplicity\n",
|
||
"loan_w_offline_feature = loan_w_offline_feature.drop([\"event_timestamp\", \"created_timestamp__\", \"loan_id\", \"zipcode\", \"dob_ssn\"], axis=1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "TxDFT1776XgP"
|
||
},
|
||
"source": [
|
||
"Now let's take a look at the training data as it is augmented by Feast."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 540
|
||
},
|
||
"id": "s-w2696D6h78",
|
||
"outputId": "f98c1dbe-1916-41bc-d543-590bac08caf5"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>person_age</th>\n",
|
||
" <th>person_income</th>\n",
|
||
" <th>person_home_ownership</th>\n",
|
||
" <th>person_emp_length</th>\n",
|
||
" <th>loan_intent</th>\n",
|
||
" <th>loan_amnt</th>\n",
|
||
" <th>loan_int_rate</th>\n",
|
||
" <th>loan_status</th>\n",
|
||
" <th>city</th>\n",
|
||
" <th>state</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>total_wages</th>\n",
|
||
" <th>credit_card_due</th>\n",
|
||
" <th>mortgage_due</th>\n",
|
||
" <th>student_loan_due</th>\n",
|
||
" <th>vehicle_loan_due</th>\n",
|
||
" <th>hard_pulls</th>\n",
|
||
" <th>missed_payments_2y</th>\n",
|
||
" <th>missed_payments_1y</th>\n",
|
||
" <th>missed_payments_6m</th>\n",
|
||
" <th>bankruptcies</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>1358886</th>\n",
|
||
" <td>55</td>\n",
|
||
" <td>24543</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>3.0</td>\n",
|
||
" <td>VENTURE</td>\n",
|
||
" <td>4000</td>\n",
|
||
" <td>13.92</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>SLIDELL</td>\n",
|
||
" <td>LA</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>315061217</td>\n",
|
||
" <td>1777</td>\n",
|
||
" <td>690650</td>\n",
|
||
" <td>46372</td>\n",
|
||
" <td>10439</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1358815</th>\n",
|
||
" <td>58</td>\n",
|
||
" <td>20000</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>EDUCATION</td>\n",
|
||
" <td>4000</td>\n",
|
||
" <td>9.99</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>CHOUTEAU</td>\n",
|
||
" <td>OK</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>59412230</td>\n",
|
||
" <td>1791</td>\n",
|
||
" <td>462670</td>\n",
|
||
" <td>19421</td>\n",
|
||
" <td>3583</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>7</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>2</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1353348</th>\n",
|
||
" <td>64</td>\n",
|
||
" <td>24000</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>MEDICAL</td>\n",
|
||
" <td>3000</td>\n",
|
||
" <td>6.99</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>BISMARCK</td>\n",
|
||
" <td>ND</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>469621263</td>\n",
|
||
" <td>5917</td>\n",
|
||
" <td>1780959</td>\n",
|
||
" <td>11835</td>\n",
|
||
" <td>27910</td>\n",
|
||
" <td>8</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1354200</th>\n",
|
||
" <td>55</td>\n",
|
||
" <td>34000</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>DEBTCONSOLIDATION</td>\n",
|
||
" <td>12000</td>\n",
|
||
" <td>6.92</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>SANTA BARBARA</td>\n",
|
||
" <td>CA</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>24537583</td>\n",
|
||
" <td>8091</td>\n",
|
||
" <td>364271</td>\n",
|
||
" <td>30248</td>\n",
|
||
" <td>22640</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>7</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1354271</th>\n",
|
||
" <td>51</td>\n",
|
||
" <td>74628</td>\n",
|
||
" <td>MORTGAGE</td>\n",
|
||
" <td>3.0</td>\n",
|
||
" <td>PERSONAL</td>\n",
|
||
" <td>3000</td>\n",
|
||
" <td>13.49</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>HUNTINGTON BEACH</td>\n",
|
||
" <td>CA</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>19749601</td>\n",
|
||
" <td>3679</td>\n",
|
||
" <td>1659968</td>\n",
|
||
" <td>37582</td>\n",
|
||
" <td>20284</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>...</th>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>674285</th>\n",
|
||
" <td>23</td>\n",
|
||
" <td>74000</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>3.0</td>\n",
|
||
" <td>MEDICAL</td>\n",
|
||
" <td>25000</td>\n",
|
||
" <td>10.36</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>MANSFIELD</td>\n",
|
||
" <td>MO</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>33180988</td>\n",
|
||
" <td>5176</td>\n",
|
||
" <td>1089963</td>\n",
|
||
" <td>44642</td>\n",
|
||
" <td>2877</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>668250</th>\n",
|
||
" <td>21</td>\n",
|
||
" <td>200000</td>\n",
|
||
" <td>MORTGAGE</td>\n",
|
||
" <td>2.0</td>\n",
|
||
" <td>DEBTCONSOLIDATION</td>\n",
|
||
" <td>25000</td>\n",
|
||
" <td>13.99</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>SALISBURY</td>\n",
|
||
" <td>MD</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>470634058</td>\n",
|
||
" <td>5297</td>\n",
|
||
" <td>1288915</td>\n",
|
||
" <td>22471</td>\n",
|
||
" <td>22630</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>2</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>668321</th>\n",
|
||
" <td>24</td>\n",
|
||
" <td>200000</td>\n",
|
||
" <td>MORTGAGE</td>\n",
|
||
" <td>3.0</td>\n",
|
||
" <td>VENTURE</td>\n",
|
||
" <td>24000</td>\n",
|
||
" <td>7.49</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>STRUNK</td>\n",
|
||
" <td>KY</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>10067358</td>\n",
|
||
" <td>6549</td>\n",
|
||
" <td>22399</td>\n",
|
||
" <td>11806</td>\n",
|
||
" <td>13005</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>670025</th>\n",
|
||
" <td>23</td>\n",
|
||
" <td>215000</td>\n",
|
||
" <td>MORTGAGE</td>\n",
|
||
" <td>7.0</td>\n",
|
||
" <td>MEDICAL</td>\n",
|
||
" <td>35000</td>\n",
|
||
" <td>14.79</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>HAWTHORN</td>\n",
|
||
" <td>PA</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>5956835</td>\n",
|
||
" <td>9079</td>\n",
|
||
" <td>876038</td>\n",
|
||
" <td>4556</td>\n",
|
||
" <td>21588</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2034006</th>\n",
|
||
" <td>22</td>\n",
|
||
" <td>59000</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>123.0</td>\n",
|
||
" <td>PERSONAL</td>\n",
|
||
" <td>35000</td>\n",
|
||
" <td>16.02</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>FORT WORTH</td>\n",
|
||
" <td>TX</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>142325465</td>\n",
|
||
" <td>8419</td>\n",
|
||
" <td>91803</td>\n",
|
||
" <td>22328</td>\n",
|
||
" <td>15078</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>28638 rows × 23 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" person_age person_income person_home_ownership person_emp_length \\\n",
|
||
"1358886 55 24543 RENT 3.0 \n",
|
||
"1358815 58 20000 RENT 0.0 \n",
|
||
"1353348 64 24000 RENT 1.0 \n",
|
||
"1354200 55 34000 RENT 0.0 \n",
|
||
"1354271 51 74628 MORTGAGE 3.0 \n",
|
||
"... ... ... ... ... \n",
|
||
"674285 23 74000 RENT 3.0 \n",
|
||
"668250 21 200000 MORTGAGE 2.0 \n",
|
||
"668321 24 200000 MORTGAGE 3.0 \n",
|
||
"670025 23 215000 MORTGAGE 7.0 \n",
|
||
"2034006 22 59000 RENT 123.0 \n",
|
||
"\n",
|
||
" loan_intent loan_amnt loan_int_rate loan_status \\\n",
|
||
"1358886 VENTURE 4000 13.92 0 \n",
|
||
"1358815 EDUCATION 4000 9.99 0 \n",
|
||
"1353348 MEDICAL 3000 6.99 0 \n",
|
||
"1354200 DEBTCONSOLIDATION 12000 6.92 1 \n",
|
||
"1354271 PERSONAL 3000 13.49 0 \n",
|
||
"... ... ... ... ... \n",
|
||
"674285 MEDICAL 25000 10.36 1 \n",
|
||
"668250 DEBTCONSOLIDATION 25000 13.99 0 \n",
|
||
"668321 VENTURE 24000 7.49 0 \n",
|
||
"670025 MEDICAL 35000 14.79 0 \n",
|
||
"2034006 PERSONAL 35000 16.02 1 \n",
|
||
"\n",
|
||
" city state ... total_wages credit_card_due \\\n",
|
||
"1358886 SLIDELL LA ... 315061217 1777 \n",
|
||
"1358815 CHOUTEAU OK ... 59412230 1791 \n",
|
||
"1353348 BISMARCK ND ... 469621263 5917 \n",
|
||
"1354200 SANTA BARBARA CA ... 24537583 8091 \n",
|
||
"1354271 HUNTINGTON BEACH CA ... 19749601 3679 \n",
|
||
"... ... ... ... ... ... \n",
|
||
"674285 MANSFIELD MO ... 33180988 5176 \n",
|
||
"668250 SALISBURY MD ... 470634058 5297 \n",
|
||
"668321 STRUNK KY ... 10067358 6549 \n",
|
||
"670025 HAWTHORN PA ... 5956835 9079 \n",
|
||
"2034006 FORT WORTH TX ... 142325465 8419 \n",
|
||
"\n",
|
||
" mortgage_due student_loan_due vehicle_loan_due hard_pulls \\\n",
|
||
"1358886 690650 46372 10439 5 \n",
|
||
"1358815 462670 19421 3583 8 \n",
|
||
"1353348 1780959 11835 27910 8 \n",
|
||
"1354200 364271 30248 22640 2 \n",
|
||
"1354271 1659968 37582 20284 0 \n",
|
||
"... ... ... ... ... \n",
|
||
"674285 1089963 44642 2877 1 \n",
|
||
"668250 1288915 22471 22630 0 \n",
|
||
"668321 22399 11806 13005 0 \n",
|
||
"670025 876038 4556 21588 0 \n",
|
||
"2034006 91803 22328 15078 0 \n",
|
||
"\n",
|
||
" missed_payments_2y missed_payments_1y missed_payments_6m \\\n",
|
||
"1358886 1 2 1 \n",
|
||
"1358815 7 1 0 \n",
|
||
"1353348 3 2 1 \n",
|
||
"1354200 7 3 0 \n",
|
||
"1354271 1 0 0 \n",
|
||
"... ... ... ... \n",
|
||
"674285 6 1 0 \n",
|
||
"668250 5 2 1 \n",
|
||
"668321 1 0 0 \n",
|
||
"670025 1 0 0 \n",
|
||
"2034006 1 0 0 \n",
|
||
"\n",
|
||
" bankruptcies \n",
|
||
"1358886 0 \n",
|
||
"1358815 2 \n",
|
||
"1353348 0 \n",
|
||
"1354200 0 \n",
|
||
"1354271 0 \n",
|
||
"... ... \n",
|
||
"674285 0 \n",
|
||
"668250 0 \n",
|
||
"668321 0 \n",
|
||
"670025 0 \n",
|
||
"2034006 0 \n",
|
||
"\n",
|
||
"[28638 rows x 23 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"display(loan_w_offline_feature)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "KpiOm00789Rd"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Convert into Train and Validation datasets.\n",
|
||
"import ray\n",
|
||
"\n",
|
||
"loan_ds = ray.data.from_pandas(loan_w_offline_feature)\n",
|
||
"train_ds, validation_ds = loan_ds.split_proportionately([0.8])\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "uFNp9sUCEGc2"
|
||
},
|
||
"source": [
|
||
"## Define Preprocessors\n",
|
||
"\n",
|
||
"[Preprocessor](https://docs.ray.io/en/latest/ray-air/getting-started.html#preprocessors) does last mile processing on Ray Datasets before feeding into training model."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {
|
||
"id": "qravnrt9EBuW"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"categorical_features = [\n",
|
||
" \"person_home_ownership\",\n",
|
||
" \"loan_intent\",\n",
|
||
" \"city\",\n",
|
||
" \"state\",\n",
|
||
" \"location_type\",\n",
|
||
"]\n",
|
||
"\n",
|
||
"from ray.ml.preprocessors import Chain, OrdinalEncoder, SimpleImputer\n",
|
||
"\n",
|
||
"imputer = SimpleImputer(categorical_features, strategy=\"most_frequent\")\n",
|
||
"encoder = OrdinalEncoder(columns=categorical_features)\n",
|
||
"chained_preprocessor = Chain(imputer, encoder)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "SqPGGd6bEn0x"
|
||
},
|
||
"source": [
|
||
"## Train XGBoost model using Ray AIR Trainer\n",
|
||
"Ray AIR provides a variety of [Trainers](https://docs.ray.io/en/latest/ray-air/getting-started.html#trainer) that are integrated with popular machine learning frameworks. You can train a distributed model at scale leveraging Ray using the intuitive API `trainer.fit()`. The output is a Ray AIR [Checkpoint](https://docs.ray.io/en/latest/ray-air/getting-started.html#module-ray.ml.checkpoint), that will seamlessly transfer the workload from training to prediction. Let's take a look!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 670
|
||
},
|
||
"id": "995W14MdFmxl",
|
||
"outputId": "417c1188-edf6-4310-dba8-366b71d77806"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/home/ray/anaconda3/lib/python3.8/site-packages/xgboost_ray/main.py:131: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n",
|
||
" XGBOOST_LOOSE_VERSION = LooseVersion(xgboost_version)\n",
|
||
"E0602 14:26:17.861773834 4511 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers\n",
|
||
"/home/ray/anaconda3/lib/python3.8/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
||
" from .autonotebook import tqdm as notebook_tqdm\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"== Status ==<br>Current time: 2022-06-02 14:26:33 (running for 00:00:14.95)<br>Memory usage on this node: 3.5/30.6 GiB<br>Using FIFO scheduling algorithm.<br>Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/18.04 GiB heap, 0.0/9.02 GiB objects<br>Result logdir: /home/ray/ray_results/XGBoostTrainer_2022-06-02_14-26-17<br>Number of trials: 1/1 (1 TERMINATED)<br><table>\n",
|
||
"<thead>\n",
|
||
"<tr><th>Trial name </th><th>status </th><th>loc </th><th style=\"text-align: right;\"> iter</th><th style=\"text-align: right;\"> total time (s)</th><th style=\"text-align: right;\"> train-logloss</th><th style=\"text-align: right;\"> train-error</th><th style=\"text-align: right;\"> validation-logloss</th></tr>\n",
|
||
"</thead>\n",
|
||
"<tbody>\n",
|
||
"<tr><td>XGBoostTrainer_a3a2c_00000</td><td>TERMINATED</td><td>172.31.71.98:12634</td><td style=\"text-align: right;\"> 100</td><td style=\"text-align: right;\"> 11.9561</td><td style=\"text-align: right;\"> 0.0578837</td><td style=\"text-align: right;\"> 0.0127019</td><td style=\"text-align: right;\"> 0.225994</td></tr>\n",
|
||
"</tbody>\n",
|
||
"</table><br><br>"
|
||
],
|
||
"text/plain": [
|
||
"<IPython.core.display.HTML object>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"(GBDTTrainable pid=12634) 2022-06-02 14:26:23,018\tINFO main.py:980 -- [RayXGBoost] Created 1 new actors (1 total actors). Waiting until actors are ready for training.\n",
|
||
"(GBDTTrainable pid=12634) 2022-06-02 14:26:25,230\tINFO main.py:1025 -- [RayXGBoost] Starting XGBoost training.\n",
|
||
"(GBDTTrainable pid=12634) E0602 14:26:25.231635524 12691 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers\n",
|
||
"(_RemoteRayXGBoostActor pid=12769) [14:26:25] task [xgboost.ray]:139712002042896 got new rank 0\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Result for XGBoostTrainer_a3a2c_00000:\n",
|
||
" date: 2022-06-02_14-26-26\n",
|
||
" done: false\n",
|
||
" experiment_id: 14b63d641b8e4583b3551b1af113ec6d\n",
|
||
" hostname: ip-172-31-71-98\n",
|
||
" iterations_since_restore: 1\n",
|
||
" node_ip: 172.31.71.98\n",
|
||
" pid: 12634\n",
|
||
" should_checkpoint: true\n",
|
||
" time_since_restore: 5.432286262512207\n",
|
||
" time_this_iter_s: 5.432286262512207\n",
|
||
" time_total_s: 5.432286262512207\n",
|
||
" timestamp: 1654205186\n",
|
||
" timesteps_since_restore: 0\n",
|
||
" train-error: 0.09502400698384984\n",
|
||
" train-logloss: 0.5147884634112437\n",
|
||
" training_iteration: 1\n",
|
||
" trial_id: a3a2c_00000\n",
|
||
" validation-error: 0.1627094972067039\n",
|
||
" validation-logloss: 0.5557328870197414\n",
|
||
" warmup_time: 0.004194974899291992\n",
|
||
" \n",
|
||
"Result for XGBoostTrainer_a3a2c_00000:\n",
|
||
" date: 2022-06-02_14-26-31\n",
|
||
" done: false\n",
|
||
" experiment_id: 14b63d641b8e4583b3551b1af113ec6d\n",
|
||
" hostname: ip-172-31-71-98\n",
|
||
" iterations_since_restore: 81\n",
|
||
" node_ip: 172.31.71.98\n",
|
||
" pid: 12634\n",
|
||
" should_checkpoint: true\n",
|
||
" time_since_restore: 10.460175275802612\n",
|
||
" time_this_iter_s: 0.03925156593322754\n",
|
||
" time_total_s: 10.460175275802612\n",
|
||
" timestamp: 1654205191\n",
|
||
" timesteps_since_restore: 0\n",
|
||
" train-error: 0.01802706241815801\n",
|
||
" train-logloss: 0.07058723671627426\n",
|
||
" training_iteration: 81\n",
|
||
" trial_id: a3a2c_00000\n",
|
||
" validation-error: 0.0824022346368715\n",
|
||
" validation-logloss: 0.22556984905196217\n",
|
||
" warmup_time: 0.004194974899291992\n",
|
||
" \n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"(GBDTTrainable pid=12634) 2022-06-02 14:26:32,801\tINFO main.py:1516 -- [RayXGBoost] Finished XGBoost training on training data with total N=22,910 in 9.81 seconds (7.57 pure XGBoost training time).\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Result for XGBoostTrainer_a3a2c_00000:\n",
|
||
" date: 2022-06-02_14-26-32\n",
|
||
" done: true\n",
|
||
" experiment_id: 14b63d641b8e4583b3551b1af113ec6d\n",
|
||
" experiment_tag: '0'\n",
|
||
" hostname: ip-172-31-71-98\n",
|
||
" iterations_since_restore: 100\n",
|
||
" node_ip: 172.31.71.98\n",
|
||
" pid: 12634\n",
|
||
" should_checkpoint: true\n",
|
||
" time_since_restore: 11.956074953079224\n",
|
||
" time_this_iter_s: 0.022900104522705078\n",
|
||
" time_total_s: 11.956074953079224\n",
|
||
" timestamp: 1654205192\n",
|
||
" timesteps_since_restore: 0\n",
|
||
" train-error: 0.01270187690964644\n",
|
||
" train-logloss: 0.05788368908741939\n",
|
||
" training_iteration: 100\n",
|
||
" trial_id: a3a2c_00000\n",
|
||
" validation-error: 0.0825768156424581\n",
|
||
" validation-logloss: 0.22599449588356768\n",
|
||
" warmup_time: 0.004194974899291992\n",
|
||
" \n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"2022-06-02 14:26:33,481\tINFO tune.py:752 -- Total run time: 15.62 seconds (14.94 seconds for the tuning loop).\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"'checkpoint'"
|
||
]
|
||
},
|
||
"execution_count": 23,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"LABEL = \"loan_status\"\n",
|
||
"CHECKPOINT_PATH = \"checkpoint\"\n",
|
||
"NUM_WORKERS = 1 # Change this based on the resources in the cluster.\n",
|
||
"\n",
|
||
"\n",
|
||
"from ray.ml.train.integrations.xgboost import XGBoostTrainer\n",
|
||
"from ray.air.config import ScalingConfig\n",
|
||
"params = {\n",
|
||
" \"tree_method\": \"approx\",\n",
|
||
" \"objective\": \"binary:logistic\",\n",
|
||
" \"eval_metric\": [\"logloss\", \"error\"],\n",
|
||
"}\n",
|
||
"\n",
|
||
"trainer = XGBoostTrainer(\n",
|
||
" scaling_config=ScalingConfig(\n",
|
||
" num_workers=NUM_WORKERS,\n",
|
||
" use_gpu=0,\n",
|
||
" ),\n",
|
||
" label_column=LABEL,\n",
|
||
" params=params,\n",
|
||
" datasets={\"train\": train_ds, \"validation\": validation_ds},\n",
|
||
" preprocessor=chained_preprocessor,\n",
|
||
" num_boost_round=100,\n",
|
||
")\n",
|
||
"checkpoint = trainer.fit().checkpoint\n",
|
||
"# This saves the checkpoint onto disk\n",
|
||
"checkpoint.to_directory(CHECKPOINT_PATH)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "QJr73gOvKBza"
|
||
},
|
||
"source": [
|
||
"## Inference\n",
|
||
"Now from the Checkpoint object we obtained from last session, we can construct a Ray AIR [Predictor](https://docs.ray.io/en/latest/ray-air/getting-started.html#predictors) that encapsulates everything needed for inference.\n",
|
||
"\n",
|
||
"The API for using Predictor is also very intuitive - simply call `Predictor.predict()`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {
|
||
"id": "wE8ielQlKYSi"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from ray.ml.checkpoint import Checkpoint\n",
|
||
"from ray.ml.predictors.integrations.xgboost import XGBoostPredictor\n",
|
||
"predictor = XGBoostPredictor.from_checkpoint(Checkpoint.from_directory(CHECKPOINT_PATH))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {
|
||
"id": "K9Y9UiD3KqSW"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/tmp/ipykernel_4511/2153939661.py:23: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.\n",
|
||
"Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
|
||
" loan_request_df = pd.DataFrame.from_dict(loan_request_dict, dtype=np.float)\n",
|
||
"/tmp/ipykernel_4511/2153939661.py:23: FutureWarning: Could not cast to float64, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised.\n",
|
||
" loan_request_df = pd.DataFrame.from_dict(loan_request_dict, dtype=np.float)\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>person_age</th>\n",
|
||
" <th>person_income</th>\n",
|
||
" <th>person_home_ownership</th>\n",
|
||
" <th>person_emp_length</th>\n",
|
||
" <th>loan_intent</th>\n",
|
||
" <th>loan_amnt</th>\n",
|
||
" <th>loan_int_rate</th>\n",
|
||
" <th>state</th>\n",
|
||
" <th>population</th>\n",
|
||
" <th>location_type</th>\n",
|
||
" <th>...</th>\n",
|
||
" <th>tax_returns_filed</th>\n",
|
||
" <th>student_loan_due</th>\n",
|
||
" <th>missed_payments_1y</th>\n",
|
||
" <th>hard_pulls</th>\n",
|
||
" <th>mortgage_due</th>\n",
|
||
" <th>bankruptcies</th>\n",
|
||
" <th>credit_card_due</th>\n",
|
||
" <th>missed_payments_2y</th>\n",
|
||
" <th>missed_payments_6m</th>\n",
|
||
" <th>vehicle_loan_due</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>133.0</td>\n",
|
||
" <td>59000.0</td>\n",
|
||
" <td>RENT</td>\n",
|
||
" <td>123.0</td>\n",
|
||
" <td>PERSONAL</td>\n",
|
||
" <td>35000.0</td>\n",
|
||
" <td>16.02</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"<p>1 rows × 22 columns</p>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" person_age person_income person_home_ownership person_emp_length \\\n",
|
||
"0 133.0 59000.0 RENT 123.0 \n",
|
||
"\n",
|
||
" loan_intent loan_amnt loan_int_rate state population location_type \\\n",
|
||
"0 PERSONAL 35000.0 16.02 NaN NaN NaN \n",
|
||
"\n",
|
||
" ... tax_returns_filed student_loan_due missed_payments_1y hard_pulls \\\n",
|
||
"0 ... NaN NaN NaN NaN \n",
|
||
"\n",
|
||
" mortgage_due bankruptcies credit_card_due missed_payments_2y \\\n",
|
||
"0 NaN NaN NaN NaN \n",
|
||
"\n",
|
||
" missed_payments_6m vehicle_loan_due \n",
|
||
"0 NaN NaN \n",
|
||
"\n",
|
||
"[1 rows x 22 columns]"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"## Now let's do some prediciton.\n",
|
||
"loan_request_dict = {\n",
|
||
" \"zipcode\": [76104],\n",
|
||
" \"dob_ssn\": [\"19630621_4278\"],\n",
|
||
" \"person_age\": [133],\n",
|
||
" \"person_income\": [59000],\n",
|
||
" \"person_home_ownership\": [\"RENT\"],\n",
|
||
" \"person_emp_length\": [123.0],\n",
|
||
" \"loan_intent\": [\"PERSONAL\"],\n",
|
||
" \"loan_amnt\": [35000],\n",
|
||
" \"loan_int_rate\": [16.02],\n",
|
||
"}\n",
|
||
"\n",
|
||
"# Now augment the request with online features.\n",
|
||
"zipcode = loan_request_dict[\"zipcode\"][0]\n",
|
||
"dob_ssn = loan_request_dict[\"dob_ssn\"][0]\n",
|
||
"online_features = fs.get_online_features(\n",
|
||
" entity_rows=[{\"zipcode\": zipcode, \"dob_ssn\": dob_ssn}],\n",
|
||
" features=feast_features,\n",
|
||
").to_dict()\n",
|
||
"loan_request_dict.update(online_features)\n",
|
||
"loan_request_df = pd.DataFrame.from_dict(loan_request_dict, dtype=np.float)\n",
|
||
"loan_request_df = loan_request_df.drop([\"zipcode\", \"dob_ssn\"], axis=1)\n",
|
||
"display(loan_request_df)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {
|
||
"id": "eS7_n1GPLL1e"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Loan rejected!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Run through our predictor using `Predictor.predict()` API.\n",
|
||
"loan_result = np.round(predictor.predict(loan_request_df)[\"predictions\"][0])\n",
|
||
"\n",
|
||
"if loan_result == 0:\n",
|
||
" print(\"Loan approved!\")\n",
|
||
"elif loan_result == 1:\n",
|
||
" print(\"Loan rejected!\")"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"colab": {
|
||
"collapsed_sections": [],
|
||
"name": "air + feast",
|
||
"provenance": []
|
||
},
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.8.5"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|