H2O Flow demo - Lending Club
Avni Wadhwa walks through predicting loan approvals through Lending Club using H2O Flow and GBM. Contribute to H2O open source machine learning software https://github.com/h2oai
Talking Points:
Speakers:
Avni Wadhwa, Analyst, H2O.ai
Read the Full Transcript
Avni Wadhwa:
In this video, we are going to be building a model that predicts whether or not a loan will default or be charged off. In other words, we are predicting whether or not the loan will be bad. The data used in this tutorial is taken from the Lending Club website.
Launch Standalone Instance
For our example, the data set we are using joins the rejected and accepted loan applications to predict whether a loan will be rejected or accepted by the Lending Club. Let's get started. First, I'm going to launch a standalone instance of H2O. Open your terminal and CD into the downloads folder, then CD into your H2O directory and launch your instance. You can now access Flow, H2O's web UI from the address local host 54321.
Import File
To get started, we are going to import our file. Click on the import files button from the menu and type the file path into the search bar. Select the appropriate file and go ahead and import it. Next, click on the parse button. You are now at the parse setup phase.
Variable Types
You can change the variable types here before parsing your file. You can see the different types of files that H2O supports in the parser menu. Since our file is a CSV comma is the selected separator field, we see that the first row contains column names is selected, and the first column in the table below is drawn from the first row of our CSV file. In our tutorial, we want to detect whether or not a loan is going to be bad, which means that we have a binary classification problem. This means that the variable we are predicting for needs to be an enum rather than numeric, so we are going to find the bad loan variable and change its type to enum. Then we are going to go ahead and parse our files.
Parse Files
We get a preview of the frame that we have created. We have about 164,000 rows and 59 columns with a compressed file size of about 36 megabytes. You can click on individual variables to get an overview of how it is represented in the data set. Now we are going to create a test train split. We do so by clicking on the split button. One frame is going to contain 70% of our original data set, while the other is going to contain the remaining 30%. I'm going to name the training set train and the test set test, and then we are going to create our split.
The train set is going to be used to train the model while the test set is going to be used to validate the model results. Now we are ready to build our model. Click on the model button in the menu running across the top of the screen and select the gradient boosting machine button. Select train from the training frame menu and test from the validation frame menu. In the response column menu, select bad loan since that is what we are predicting for.
Build Model
Scroll down and select the all button in order to ignore all the columns. Then go and deselect the following variable names. To be clear, these variables are going to be used to predict whether or not a loan is classified as bad. The variables are as follows, loan amount, term, length of employment, home ownership, annual income, verification status, purpose, address state, debt to income ratio, delinquency in the borrower's credit file for the past two years, both the FICO range low and high, inquiries by creditors in the last six months, open credit lines, number of derogatory public records, total credit revolving balance, revolving line utilization rate, total number of credit lines, number of collections in 12 months, excluding medical collections, bad loan, and credit length in years. Select score each iteration so that you get statistics on each iteration, and then leave all the parameters to their default. Then build the model.
Results
You can view your results once the model is done building. The scoring history shows the relationship between the number of trees and the decreasing mean square error for both the validation and training data sets. The blue curve represents the training data. While the orange one represents the validation data, you can click on an individual tree to get information like main square error or area under the curve at a given tree. You then get the ROC curves for both the training and validation metrics which show the relationship between the true and false positive rates. You can get information about a spectrum of threshold values by selecting an area along the curve, or you can select a specific criteria from the menu here. You can view your variable importance graph to see the relative effect each variable had on your model. Lastly, you can preview your pojo, which you would use to put this model into production.