April 15th, 2019

Building AI/ML models on Lending Club Data, with H2O.ai — Part 2

RSS icon RSS Category: AutoML, Data Journalism, Data Science, H2O Driverless AI

In Part 1 of this series earlier, we looked at how to download data from Lending Club using Jupyter/Python and create a training and test data set, after dropping some target leakage cols. The data preparation code to create the data sets for classification is available in GitHub at: https://git.io/fjTqb

In this blog post, we are going to use H2O-3 AutoML to build a model on the training set and score on the test data set, using the Python client library “h2o”. We will extract features that are important in predicting whether a loan will be “Fully Paid” or “Charged Off”.

First, you need to download the latest python client library (& also a single instance of H2O-3) from this page:


Click “Install in Python” and follow instructions.

H2O-3 AutoML can run multiple algorithms, do hyperparameter tuning, cross-validation, create Stack Ensembles on winning algorithms and create a self-contained scoring package, that can be deployed in production.

Algorithms tried by H2O-3 AutoML (as of version

  • DRF — Distributed Random Forest
  • GLM — Generalized Linear Model
  • XGBoost — XGBoost GBM
  • GBM — H2O GBM
  • Deep Learning
  • Stacked Ensemble of above

AutoML can be kicked off in Open Source H2O-3 by either R or Python language interface or by using H2O Flow which is a browser UI to do the interactive model building.

For Lending Club data use case, the python notebook below explains how you’d connect to an H2O-3 cluster in the cloud or local instance, upload the training/test data, kick off AutoML with some basic parameters. It also explains how to view the composition of the AutoML Leader (which is usually a stacked ensemble), run Variable Importance for multiple algorithms in the AutoML leaderboard and analyze the results. There is finally code to predict the outcome loan_status, for the test data set and analyze the test model performance.

The Jupyter Python notebook in this blog post is available from GitHub: https://git.io/fjke6

Run Automatic Machine Learning in a few steps:

AutoML Performance:

One of the things to observe below is how H2O-3 AutoML ran multiple algorithms like XGBoost, GLM, Deep Learning, GBM, etc., Also the top 2 models with the highest AUC are Stacked Ensembles built on the rest of the models in the leaderboard.

How to Gain Insights into the model?

The standardized Coef. Magnitudes of the GLM model in the leaderboard gives us a sense of what’s different about a Loan Getting Paid in Full vs Loan getting Charged Off/Defaulted. The features/attributes in blue are the positive reasons (Length of the bar is the order of importance) why the Loan is getting Paid in Full vs the one in the Orange which can be attributed to Loan defaulting. In summary:

Top 7 Factors that are correlated to Loan getting Fully Paid – in the order of importance (Looking only at the Blue bars):

  • 36_months– If the Loan term is shorter, like 3 years
  • A– If the Loan Grade is “A”
  • total_bc_limit– If the total bank card credit Limit is high
  • mo_sub_old_rev_tl_op– If a lot of months since most recent revolving account opened
  • MORTGAGE– whether a customer had a Home Mortgage Loan open
  • total_il_high_credit_limit– Total installment high credit/credit limit (Kind of %payments to total credit limit)
  • earliest_cr_line– When the first credit line was opened

Top 7 Factors that are correlated to Loan getting Charged Off – in the order of importance (Looking only at the Orange bars):

  • int_rate– Interest Rate was high on the loan
  • 60_months– If the loan term is longer, like 5 years
  • <ABC> –
  • acc_open_past_24_mnts– high # of accounts opened in past 24 months
  • dti– Debt to Income ratio is high
  • issue_d– month/year which a loan was issued
  • RENT– Whether a customer was renter instead of Home Owner

The income/credit/debt characteristics of customers are discovered by the model automatically from the data. However, it’s important that correlation should not be mistaken for causation (which is not the scope of the blog).

As opposed to the GLM model in the leaderboard, you can also walk through each model in the leaderboard and look at variable importance – See code in the original notebook: https://git.io/fjke6

How to Learn more about H2O-3 AutoML ?

For learning more about H2O-3 Open Source AutoML, see link to Erin LeDell’s youtube video below:


Summary of Results

The final AUC on the test set was 0.729 above.

The data was a snapshot on time where loans where running (some early stage and some late) and not necessarily “cohorts”. In the data preparation phase, we also dropped lot of columns that was giving away the outcome. The models built are still very useful to understand the drivers behind the outcome. By using additional H2O-3 API, you can download scoring artifacts to productionize the model. So, how to improve the Accuracy and see full Machine Learning Interpretability of the final model etc .,?

Next Steps

H2O3 AutoML can help you build models really quickly and understand the variable importance with very little effort. Recall, we didn’t do any feature engineering (like one-hot-encoding etc.,) at-all to the input data! In the next blog posts, we will explore how to do the following – in addition to Automatic Machine Learning:

  • Automatic Feature Engineering
  • Machine Learning Interpretability

with H2O’s commercial product Driverless AI.

About the Authors

vinod iyengar
Vinod Iyengar, VP of Products

Vinod is VP of Products at H2O.ai. He leads all product marketing efforts, new product development and integrations with partners. Vinod comes with over 10 years of Marketing & Data Science experience in multiple startups. He was the founding employee for his previous startup, Activehours (Earnin), where he helped build the product and bootstrap the user acquisition with growth hacking. He has worked to grow the user base for his companies from almost nothing to millions of customers. He’s built models to score leads, reduce churn, increase conversion, prevent fraud and many more use cases. He brings a strong analytical side and a metrics driven approach to marketing. When he is not busy hacking, Vinod loves painting and reading. He is a huge foodie and will eat anything that doesn’t crawl, swim or move.

Karthik Guruswamy

Karthik is a Principal Pre-sales Solutions Architect with H2O. In his role, Karthik works with customers to define, architect and deploy H2O’s AI solutions in production to bring AI/ML initiatives to fruition.

Karthik is a “business first” data scientist. His expertise and passion have always been around building game-changing solutions - by using an eclectic combination of algorithms, drawn from different domains. He has published 50+ blogs on “all things data science” in Linked-in, Forbes and Medium publishing platforms over the years for the business audience and speaks in vendor data science conferences. He also holds multiple patents around Desktop Virtualization, Ad networks and was a co-founding member of two startups in silicon valley.

Leave a Reply

Recap of H2O World India 2023: Advancements in AI and Insights from Industry Leaders

On April 19th, the H2O World  made its debut in India, marking yet another milestone

May 29, 2023 - by Parul Pandey
Enhancing H2O Model Validation App with h2oGPT Integration

As machine learning practitioners, we’re always on the lookout for innovative ways to streamline and

May 17, 2023 - by Parul Pandey
Building a Manufacturing Product Defect Classification Model and Application using H2O Hydrogen Torch, H2O MLOps, and H2O Wave

Primary Authors: Nishaanthini Gnanavel and Genevieve Richards Effective product quality control is of utmost importance in

May 15, 2023 - by Shivam Bansal
AI for Good hackathon
Insights from AI for Good Hackathon: Using Machine Learning to Tackle Pollution

At H2O.ai, we believe technology can be a force for good, and we're committed to

May 10, 2023 - by Parul Pandey and Shivam Bansal
H2O democratizing LLMs
Democratization of LLMs

Every organization needs to own its GPT as simply as we need to own our

May 8, 2023 - by Sri Ambati
h2oGPT blog header
Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey

At H2O.ai, we pride ourselves on developing world-class Machine Learning, Deep Learning, and AI platforms.

May 3, 2023 - by Arno Candel

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More