Lending Club publishes its basic loan databases to the public and a full version to its customers — anonymized of course. You can find the download page from this link (screenshot below):
The publicly downloadable loan data has various attributes — roughly 150+ columns that have categorical, numeric, text and date fields. It also has a ‘loan_status’ text column that indicates if the loan was Fully paid or Charged off. The data makes it ideal to create a binary classification problem with Machine Learning.
In this blog post series, we are going to explore how to do Automatic Machine Learning with the H2O.ai ML product suite. H2O.ai has two Auto ML solutions:
The Open Source version has been around for several years and used by thousands of users and is a scale-out enterprise product. It has basic Auto ML (Automatic Machine Learning) support.
The Driverless AI, however, was just announced last year by H2O.ai for commercial use. The product basically runs on a single instance of a server today with GPUs optionally. Besides Automatic Machine Learning, it has a rich set of features like Automatic Feature Engineering (with > 30 feature transformers including NLP!), Auto-Viz, Auto-Doc, Machine Learning Explainability etc.,
The goal of this blog post series is to show you how to use Automatic Machine Learning and other features using a Jupyter notebook interface. We will use the H2O.ai’s Python client libraries to connect to both H2O-3 Open Source as well as Driverless AI and build AI/ML models and fully take advantage of the capabilities provided.
Data Prep
You can run the Python 3 code below in a Jupyter notebook to create two CSV files — train_lc.csv and test_lc.csv from Lending Club Data.
As part of data cleansing and preparation, we drop some target leakage columns to make sure we get a model that is worthy of production use.
The notebook above is available from here -> https://git.io/fjTqb
In the next blog in this series, we will explore how to kick off Automatic Machine Learning with both H2O-3 Open Source as well as Driverless AI on the training data set, perform scoring on the test data set and compare/contrast the features & results across both products! You can read the second part here .