April 15th, 2019
Building AI/ML models on Lending Club Data, with H2O.ai — Part 2
RSS Share Category: AutoML, Data Journalism, Data Science, H2O Driverless AI
By: Vinod Iyengar, VP of Products and Karthik Guruswamy
In Part 1 of this series earlier, we looked at how to download data from Lending Club using Jupyter/Python and create a training and test data set, after dropping some target leakage cols. The data preparation code to create the data sets for classification is available in GitHub at: https://git.io/fjTqb
In this blog post, we are going to use H2O-3 AutoML to build a model on the training set and score on the test data set, using the Python client library “h2o”. We will extract features that are important in predicting whether a loan will be “Fully Paid” or “Charged Off”.
First, you need to download the latest python client library (& also a single instance of H2O-3) from this page:
http://h2o-release.s3.amazonaws.com/h2o/rel-yates/1/index.html
Click “Install in Python” and follow instructions.
H2O-3 AutoML can run multiple algorithms, do hyperparameter tuning, cross-validation, create Stack Ensembles on winning algorithms and create a self-contained scoring package, that can be deployed in production.
Algorithms tried by H2O-3 AutoML (as of version 3.24.0.1):
- DRF — Distributed Random Forest
- GLM — Generalized Linear Model
- XGBoost — XGBoost GBM
- GBM — H2O GBM
- Deep Learning
- Stacked Ensemble of above
AutoML can be kicked off in Open Source H2O-3 by either R or Python language interface or by using H2O Flow which is a browser UI to do the interactive model building.
For Lending Club data use case, the python notebook below explains how you’d connect to an H2O-3 cluster in the cloud or local instance, upload the training/test data, kick off AutoML with some basic parameters. It also explains how to view the composition of the AutoML Leader (which is usually a stacked ensemble), run Variable Importance for multiple algorithms in the AutoML leaderboard and analyze the results. There is finally code to predict the outcome loan_status, for the test data set and analyze the test model performance.
The Jupyter Python notebook in this blog post is available from GitHub: https://git.io/fjke6
Run Automatic Machine Learning in a few steps:
AutoML Performance:
One of the things to observe below is how H2O-3 AutoML ran multiple algorithms like XGBoost, GLM, Deep Learning, GBM, etc., Also the top 2 models with the highest AUC are Stacked Ensembles built on the rest of the models in the leaderboard.
How to Gain Insights into the model?
The standardized Coef. Magnitudes of the GLM model in the leaderboard gives us a sense of what’s different about a Loan Getting Paid in Full vs Loan getting Charged Off/Defaulted. The features/attributes in blue are the positive reasons (Length of the bar is the order of importance) why the Loan is getting Paid in Full vs the one in the Orange which can be attributed to Loan defaulting. In summary:
Top 7 Factors that are correlated to Loan getting Fully Paid – in the order of importance (Looking only at the Blue bars):
- 36_months– If the Loan term is shorter, like 3 years
- A– If the Loan Grade is “A”
- total_bc_limit– If the total bank card credit Limit is high
- mo_sub_old_rev_tl_op– If a lot of months since most recent revolving account opened
- MORTGAGE– whether a customer had a Home Mortgage Loan open
- total_il_high_credit_limit– Total installment high credit/credit limit (Kind of %payments to total credit limit)
- earliest_cr_line– When the first credit line was opened
Top 7 Factors that are correlated to Loan getting Charged Off – in the order of importance (Looking only at the Orange bars):
- int_rate– Interest Rate was high on the loan
- 60_months– If the loan term is longer, like 5 years
- <ABC> –
- acc_open_past_24_mnts– high # of accounts opened in past 24 months
- dti– Debt to Income ratio is high
- issue_d– month/year which a loan was issued
- RENT– Whether a customer was renter instead of Home Owner
The income/credit/debt characteristics of customers are discovered by the model automatically from the data. However, it’s important that correlation should not be mistaken for causation (which is not the scope of the blog).
As opposed to the GLM model in the leaderboard, you can also walk through each model in the leaderboard and look at variable importance – See code in the original notebook: https://git.io/fjke6
How to Learn more about H2O-3 AutoML ?
For learning more about H2O-3 Open Source AutoML, see link to Erin LeDell’s youtube video below:
Summary of Results
The final AUC on the test set was 0.729 above.
The data was a snapshot on time where loans where running (some early stage and some late) and not necessarily “cohorts”. In the data preparation phase, we also dropped lot of columns that was giving away the outcome. The models built are still very useful to understand the drivers behind the outcome. By using additional H2O-3 API, you can download scoring artifacts to productionize the model. So, how to improve the Accuracy and see full Machine Learning Interpretability of the final model etc .,?
Next Steps
H2O3 AutoML can help you build models really quickly and understand the variable importance with very little effort. Recall, we didn’t do any feature engineering (like one-hot-encoding etc.,) at-all to the input data! In the next blog posts, we will explore how to do the following – in addition to Automatic Machine Learning:
- Automatic Feature Engineering
- Machine Learning Interpretability
with H2O’s commercial product Driverless AI.