Deploying an XGBoost Model

In this article, you will learn how to deploy an XGBoost model on the Platform.

This example will utilize the Lending Club dataset from Kaggle to illustrate how you can use the Platform’s deployed API functionality. The purpose of the model is to identify the loans that are going to default. It’s a classification problem for which XGBoost is well-suited.

In a nutshell, XGBoost is a distributed gradient boosting library. XGBoost provides a parallel tree boosting algorithm that is quite fast. For more details about XGBoost, visit XGBoost documentation.

The Business Use Case

Lending Club is a peer-to-peer online lending platform where individuals can get approved for loans. These loans are broken down in $25 notes that can be purchased by investors on the Lending Club platform.

Before purchasing notes, investors have access to a variety of information about the loan (such as loan purpose, amount, interest rate, etc.) and the credit history of the borrower (income, delinquencies, home ownership, number of credit lines opened, etc.). The purpose of the model in this example is to predict the probability that a given loan will default before reaching maturity. Investors can then avoid these bad notes and focus on the ones less likely to default.

Loading the Data and Training the Model

In a Jupyter notebook (Python 2 session) on the Platform, start by loading and training the model.

First load the data from the public S3 bucket. Use the function s3_pull_file() that we defined in this Connect to Data Sources page. Put in your AWS keys that are stored in your project environment variables. You will also need these libraries:

# Platform Kernels: Python 2,3
# Snippet Libraries: xgboost==0.6, boto3==1.4.4, pandas==0.20.3

import xgboost as xgb
import sys
import boto3
import os
import pickle as pkl
import pandas as pd

Next, load the data in this cell:

s3_creds={'access_key': os.environ['YOUR_AWS_ACCESS_KEY'],
          'secret_key': os.environ['YOUR_AWS_SECRET_KEY']}
s3_pull_file('ds-site-static-assets', 'ds-examples/loan-risk/data/demo_data.p', './demo_data.p', s3_creds)
loan_data = pkl.load(open('demo_data.p', 'rb'))

We’ve already manipulated the data and performed one-hot encoding of the categorical features. You can explore the training feature dataset and response variables by executing the following cell:

print(loan_data['X_train'].head())
print(loan_data['y_train'])

For simplicity, make the assumption that you have found the best hyperparameters of your model. Then, train the XGBoost model with the best set of hyperparameters:

train_data = xgb.DMatrix(loan_data['X_train'], label=loan_data['y_train'])
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, train_data, num_round)

Now, test the predict function with the first ten entries of the training set:

tmp_test_data = xgb.DMatrix(loan_data['X_train'].head(10))
preds = bst.predict(tmp_test_data)
preds

The results you should see are as follows:

array([ 0.25122303, 0.4519583 , 0.25122303, 0.15333922, 0.1026786 ,
0.15333922, 0.1026786 , 0.25122303, 0.25122303, 0.25122303],
dtype=float32)

You’ve now trained an XGBoost model and are able to use its predict() function on a loan dataset. That mimics what would happen in production: you get information about the loan and you want to predict the default probability with a pre-trained model.

Next, serialize the model and write it to disk using the pickle library. The model is converted into a byte stream that can be read by the load() method of the pickle library.

# let's write to disk a serialized version of our xgboost model:
pkl.dump(bst, open('xgb_model.pkl','wb'))

The model is saved to disk.

Deploying the Model

In a script file we called xgboost_model_api.py, we wrote the function api_predict(data) we want to deploy (see below). Remember that the data that is being passed as an argument to the function is in JSON format. That is why we perform a conversion from JSON to dataframe.

# Content of the file 'xgboost_model_api.py'

import xgboost as xgb
import pandas as pd
import os
import sys
import pickle as pkl

# Let's load the pickled xgboost model:
xgb_model_api = pkl.load(open('xgb_model.pkl','rb'))

def api_predict(data):
    """This function takes the loan data in a JSON format
    and returns the probability of default based on an XGBoost
    model.

    Parameters
    ----------

    loan_data: JSON data structure containing the loan data.

    Returns
    -------

    A dictionary with the default probabilities. The keys correspond to the
    loan IDs.
    """

    # Conversion to a data frame :
    json_2_df = pd.DataFrame.from_dict(data,orient='index')

    # Re-order the columns to match the column order of the training dataset.
    # This is important. Internally xgboost transform your dataset into the
    # libsvm data format. This format takes the index of the feature and associates
    # a value to the index value. It's a sparse data format. Consequently, preserving
    # feature order is important.
    json_2_df = json_2_df[[u'loan_amnt', u'int_rate', u'dti', u'annual_inc', u'delinq_2yrs',
       u'open_acc', u'revol_util', u'term_ 36 months', u'term_ 60 months',
       u'purpose_car', u'purpose_credit_card', u'purpose_debt_consolidation',
       u'purpose_educational', u'purpose_home_improvement', u'purpose_house',
       u'purpose_major_purchase', u'purpose_medical', u'purpose_moving',
       u'purpose_other', u'purpose_renewable_energy',
       u'purpose_small_business', u'purpose_vacation', u'purpose_wedding',
       u'addr_state_AK', u'addr_state_AL', u'addr_state_AR', u'addr_state_AZ',
       u'addr_state_CA', u'addr_state_CO', u'addr_state_CT', u'addr_state_DC',
       u'addr_state_DE', u'addr_state_FL', u'addr_state_GA', u'addr_state_HI',
       u'addr_state_IA', u'addr_state_ID', u'addr_state_IL', u'addr_state_IN',
       u'addr_state_KS', u'addr_state_KY', u'addr_state_LA', u'addr_state_MA',
       u'addr_state_MD', u'addr_state_ME', u'addr_state_MI', u'addr_state_MN',
       u'addr_state_MO', u'addr_state_MS', u'addr_state_MT', u'addr_state_NC',
       u'addr_state_ND', u'addr_state_NE', u'addr_state_NH', u'addr_state_NJ',
       u'addr_state_NM', u'addr_state_NV', u'addr_state_NY', u'addr_state_OH',
       u'addr_state_OK', u'addr_state_OR', u'addr_state_PA', u'addr_state_RI',
       u'addr_state_SC', u'addr_state_SD', u'addr_state_TN', u'addr_state_TX',
       u'addr_state_UT', u'addr_state_VA', u'addr_state_VT', u'addr_state_WA',
       u'addr_state_WI', u'addr_state_WV', u'addr_state_WY',
       u'home_ownership_ANY', u'home_ownership_MORTGAGE',
       u'home_ownership_NONE', u'home_ownership_OTHER', u'home_ownership_OWN',
       u'home_ownership_RENT']]

    loan_data_dmatrix = xgb.DMatrix(json_2_df)
    # Here we are using the predict() method of the XGBoost model to predict
    # default on the data passed to the api_predict() function.
    res = xgb_model_api.predict(loan_data_dmatrix)

    return { "{}".format(loan_id):result for result,loan_id in zip(res.tolist(),json_2_df.index.values.tolist()) }

The function api_predict() takes data in JSON format. In the case above, we expect the data to have this particular format:

 {"data":
{"107265":{"loan_amnt":14000.0,"int_rate":17.57,"dti":21.6,"annual_inc":82000.0,"delinq_2yrs":1.0,"open_acc":24.0,"revol_util":43.8,"term_
36 months":1,"term_ 60
months":0,"purpose_car":0,"purpose_credit_card":0,"purpose_debt_consolidation":1,"purpose_educational":0,"purpose_home_improvement":0,"purpose_house":0,
"purpose_major_purchase":0,"purpose_medical":0,"purpose_moving":0,"purpose_other":0,"purpose_renewable_energy":0,"purpose_small_business":0,
"purpose_vacation":0,"purpose_wedding":0,"addr_state_AK":0,"addr_state_AL":0,"addr_state_AR":0,"addr_state_AZ":1,"addr_state_CA":0,
"addr_state_CO":0,"addr_state_CT":0,"addr_state_DC":0,"addr_state_DE":0,"addr_state_FL":0,"addr_state_GA":0,"addr_state_HI":0,
"addr_state_IA":0,"addr_state_ID":0,"addr_state_IL":0,"addr_state_IN":0,"addr_state_KS":0,"addr_state_KY":0,"addr_state_LA":0,
"addr_state_MA":0,"addr_state_MD":0,"addr_state_ME":0,"addr_state_MI":0,"addr_state_MN":0,"addr_state_MO":0,"addr_state_MS":0,
"addr_state_MT":0,"addr_state_NC":0,"addr_state_ND":0,"addr_state_NE":0,"addr_state_NH":0,"addr_state_NJ":0,"addr_state_NM":0,
"addr_state_NV":0,"addr_state_NY":0,"addr_state_OH":0,"addr_state_OK":0,"addr_state_OR":0,"addr_state_PA":0,"addr_state_RI":0,
"addr_state_SC":0,"addr_state_SD":0,"addr_state_TN":0,"addr_state_TX":0,"addr_state_UT":0,"addr_state_VA":0,"addr_state_VT":0,
"addr_state_WA":0,"addr_state_WI":0,"addr_state_WV":0,"addr_state_WY":0,"home_ownership_ANY":0,"home_ownership_MORTGAGE":1,
"home_ownership_NONE":0,"home_ownership_OTHER":0,"home_ownership_OWN":0,"home_ownership_RENT":0}}}

The value of the key “data” is the data structure our function is expecting. The index of the data frame will be the keys of the “data” dictionary (orient=index).

If there are any dependencies that are not part of the environment, you can create a pip dependency file to capture any libraries needed to deploy your model. Name this file api_requirements.txt and put it in the top level folder of your project. In this case, XGBoost and pandas are needed outside the standard dependency collection. Our file thus contains the following dependencies:

xgboost>=0.6
pandas>=0.20.3

These packages need to be installed on the REST API Docker container for your model to work. After that step is done, make sure you Sync your work with remote GitHub. This will generate a new commit ID containing the latest changes.

You’re now ready to deploy the model as a REST API! The first step in this process is to go to Deploy an API from the Quick Actions button.

Below is a screenshot of the filled out Deploy form. Specify the Python script file you used (xgboost_model_api.py) as well as the name of the function you want to deploy (api_predict()). You can select the compute resource size and the environment needed for your script. You can also provide an example of a dataset you want to pass to the API. In this case, you may use the example mentioned above.

The last step is to specify the dependencies, if any, of your deployed function. This can be done by clicking the Add Requirements option. Include the name of your pip or apt requirements file. We recommend putting that file in the same folder as the Python script containing your deployed function.

../_images/deploy-lc-xgboost-1.png

Click Deploy and your function has now been deployed as a REST API.

Next, take a look at the Versions tab of your API. An example snapshot is below.

../_images/deploy-lc-xgboost-2.png

Conclusion

You now have a deployed function that predicts the default probability of a loan available on the Lending Club platform. You can use this API endpoint in a variety of applications. We show three different ways to call this API using cURL, Python, and Node.

A colleague can call your model within their Python code or a front-end web developer can call the API using Node and create a web-based app that allows institutional investors to select loans to purchase.