How an Economics Nobel Prize could revolutionize insurance and lending
In Part 1 , we proposed better revenue and managing regulatory requirements with machine learning (ML). We made the first part of the argument by showing how gradient boosting machines (GBM), a type of ML, can match exactly, then exceed, both the technical merits and the business value of popular generalized linear models (GLMs) using a straightforward insurance example.
Part 2 of this blog uses a more realistic and detailed credit card default scenario to show how monotonicity constraints, Shapley values and other post-hoc explanations, and discrimination testing can enable practitioners to create direct comparisons between GLM and GBM models. Such comparisons can then enable practitioners to build from GLM to more complex GBM models in a step-by-step manner, while retaining model transparency and the ability to test for discrimination. In our credit use case, we show that a GBM can lead to better accuracy, more revenue, and that the GBM is also likely to fulfill model documentation, adverse action notice, and discrimination testing requirements.
Some bankers have recently voiced skepticism about artificial intelligence (AI) in lending – and rightly so.1 To be clear, we’re not advocating for head-over-heels AI hype. We hope to present a judicious, testable and step-by-step method to transition from GLM to GBM in Part 2 of this post. Perhaps obviously, we feel that GBMs can model the real-world and credit risk better than GLMs. We also think ML can be done while preserving extremely high-levels of transparency and while keeping algorithmic discrimination at bay.
Part 1 of this article has already shown that GBMs might not only be more accurate predictors, but when combined with Shapley values, they can also be more accurate for explanations and attributions of causal impact to predictors. To build off Part 1, we now want to showcase the potential transparency and business benefits presented by a transition from GLM to GBM in lending. For full transparency (and hopefully replicability) we use the UCI credit card data, which is freely available from the UCI machine learning dataset repository, and open source h2o-3 code. The credit card dataset contains information about 30,000 credit card customers regarding demographic characteristics, and payment and billing information. The dependent variable to predict is payment delinquency. Our use case will compare a GLM to several different GBMs from a technical and business perspective.
In order to show the business value of GBMs compared to GLMs, we ran a simulation with different algorithms (GLM, Monotonic GBM (MGBM), GBM, and a Hybrid GBM). Here’s a short summary of the models we tried:
As you’ll see below, there’s a large difference in projected revenue across these models in our simulations. Why? The models judge risk differently. A GLM is not able to reflect non-linear relationships and interactions between variables without manual adjustments. While GBMs do inherently model non-linear relationships and interactions between variables. In the past, GLMs were considered highly transparent, and as such, preferred for business purposes. GBMs were considered to be black-boxes in the past. But using monotonicity constraints can give GBMs much of the same inherent interpretability as GLMs, and without sacrificing accuracy. Table 1 provides an overview of the different simulation models and their capabilities.
Table 1: Overview of models and their capabilities. “+” indicates the presence of a capability. “-” indicates the absence of a capability. Capabilities are measured relative to the GLM.
All models were selected by grid search and evaluated using validation AUC. The outcome can be seen in Table 2. The GLM model has the lowest AUC score with 0.73, while the best GBMs reach an AUC of 0.79. Great … but what does that mean for a business?
|Business Impact ↑
Table 2: Model simulation outcomes. Arrows indicate the direction of improvement for each measurement. Currency is $NT.
To assess the business impact of each model, we make a few basic assumptions as summarized in Table 3. A credit card customer which we accurately classify as a delinquent customer, is neutral to costs. An accurately classified customer that is not delinquent will bring $NT 20,000 of lifetime value, classifying a customer incorrectly as delinquent will cost $NT 20,000 in lifetime value, and incorrectly extending credit to a delinquent customer will lead to write-offs of $NT 100,000.
|True Positive $0
|False Positive -$20,000
|False Negative -$100,000
|True Negative $20,000
Table 3: Assumptions used to estimate business impact in Table 2. Currency is $NT.
Based on those assumptions, the outcome of the GLM model shows the lowest revenue of $NT 7.9M. The model with the highest impact is the Hybrid GBM, with a business value of $NT 19.22M, almost 2.5 times of the GLM model!
How is this possible? The Hybrid GBM model is much better at avoiding those costly false negative predictions. This simple example shows that the transition from GLM to GBM models can significantly increase business value and reduce risk. Now, let’s have a look at how to compare GLMs to GBMs from a technical perspective, so you can decide how and when to transition from GLMs to GBMs.
It is true that GLMs have exceptional interpretability. Believe it or not, GBMs can also have extremely high interpretability. GLMs are so interpretable because of their additive monotonic form and low-degree of interacting variables. By the judicious application of monotonicity and interaction constraints, GBM users can now bring domain knowledge to bear on a modeling task, increase interpretability, avoid overfitting to development data, and make judicious decisions about when to use what kind of model.2
Users can often specify the same monotonicity of variables found in GLMs in a GBM. In Figure 1 (right) , the GLM-modeled behavior of the feature PAY_0, or a customer’s most recent repayment status, is monotonic increasing. As PAY_0 values become larger, probability of default also becomes larger. In Figure 1 (right), the MGBM models the same behavior using a positive monotonic constraint – just like the GLM! As PAY_0 increases under the GBM, probability of default also increases.
What’s different is the functional form of the GLM versus the functional form of the GBM. The GLM is restricted to a logistic curve in this case, whereas the GBM can take on an arbitrarily complex stair-step form that also obeys the user-supplied monotonicity constraint. Additionally, Figure 1 shows how the GBM and Hybrid models behave for PAY_0 without monotonicity constraints. Which looks more realistic to you? The histogram and mean behavior of PAY_0 in Figure 1 (left) can help you determine which model fits the data best. The goal is to match the red mean target line as weighted by the histogram. The good news is, no matter which functional form you prefer, you can now use it in an interpretable way that reflects business domain knowledge.
Figure 1: Histogram of PAY_0 with the mean value of the target, DEFAULT_NEXT_MONTH, by PAY_0 in red (left). Partial dependence and ICE estimated behavior of each model (right). The left figure helps gauge the fit of the models on the right.
How are we able to see the way that the GBM modeled PAY_0? Through what’s known as partial dependence and individual conditional expectation (ICE) plots , . In Figure 1 (right), partial dependence (red) displays the estimated average behavior of the model. The other ICE curves in the plot show how certain individuals behave under the model. By combining the two approaches, we get a chart that displays the overall and individual behavior of our model. Partial dependence and ICE are just one example of post-hoc explanation techniques, or processes we can run on a model after it’s trained to get a better understanding of how it works.
Shapley values, a Nobel-laureate technique from game theory, provide additional and crucial insights for the GBM models . When applied to tree-based models like GBM, Shapley values are a highly accurate measurement of how variables contribute to a model’s predictions, both overall and for any individual customer. This is incredibly helpful for interpretability as it enables:
Figure 2: Absolute Pearson correlation to DEFAULT_NEXT_MONTH and average feature contributions for each model. GLM and GBM contributions can be interpreted in the log odds space. This figure allows for a meaningful overall comparison of simple models to more complex models.
Just like Figure 1 shows how the treatment of PAY_0 changes from a GLM to different GBMs, a direct comparison between the overall (Figure 2) and per-customer (Figure 3) variable contributions for GLM and GBM is now possible. In Figure 2, we can see how each model treats variables from an overall perspective, and compare simple models (e.g., Pearson correlation, GLM) to the more complex models. In Figure 3, we can see how each model arrived at its prediction for three individual customers. All this enables a direct comparison of GLM and GBM treatment of variables, so you can both adequately document GBMs and make decisions about the transition to GBM with confidence! Moreover, the per-customer information displayed in Figure 3 could also provide the raw data needed for adverse action notices, a serious regulatory consideration in credit lending.
Figure 3: Variable contributions to predictions for individuals at selected percentiles of GLM model predictions. This type of analysis allows for meaningful comparisons of how simpler and more complex models treat individual customers, and potentially for the generation of adverse action notices.
Another crucial aspect of model diagnostics that must be conducted under several federal and local regulations in the US is discrimination testing. Of course, if certain types of discrimination are found, then they must be fixed too. The transparency of GLMs is often helpful in this context. However, the constraints and post-hoc explanation steps outlined above, make finding and fixing discrimination in GBM models much easier than it used to be. Moreover, the concept of the multiplicity of good models in ML – where a single dataset can generate many accurate candidate models – presents a number of options for fixing discrimination that were often not available for GLMs.3
In our credit card example, the GBM is tested for discrimination using measures with long-standing legal and regulatory precident: adverse impact ratio (AIR), marginal effect (ME), and standardized mean difference (SMD). Those results are available in Table 4. Luckily, between men and women, there is little evidence of discrimination for any of our models. However, if discrimination was found, GBMs may actually present more options for remediation than GLMs.
In addition to variable and hyperparameter selection, researchers have put forward potentially compliant adversarial approaches for training non-discriminatory ML models , and GBMs now offer users monotonicity and interaction constraints that can help fix discriminatory model outcomes. Basically, GBMs just have more knobs to turn than GLMs, leaving more wiggle room to find an accurate and non-discriminatory model. Likewise, the post-hoc explanation techniques described above can also be used to understand drivers of algorithmic discrimination and to validate there removal from GBM models.
Table 4. Discrimination measures for the tested models.
Many government agencies have telegraphed likely future ML regulation, or outside of the US, started to implement such regulations. It’s important to note that US government watchdogs are not saying ML is forbidden. Generally speaking, they are saying make sure your ML is documented, explainable, managed, monitored, and minimally discriminatory. Arguably, the steps outlined in Part 2 provide a blueprint for explainability and discrimination testing with GBM, which should in turn help with aspects of model documentation. Moreover, most large financial institutions already have model governance and monitoring processes in place for their traditional predictive models. These could potentially be adapted to ML models.
Of course, it’s really not the place of a software vendor to opine on what is, and what is not, compliant with regulations. So, have a look for yourself to see what some US government agencies are thinking:
Outside of government, some financial services organizations are already claiming to use machine learning in regulated dealings and researchers are publishing on GBM and Shapley values for credit lending applications. For instance, in 2018 Equifax announced their Neurodecision system, “a patent-pending machine learning technology for regulatory-compliant, advanced neural network modeling in credit scoring.” Since 2018, Wells Fargo has also introduced several machine learning techniques for model validation, including LIME-SUP , explainable neural networks , and a number of additional model debugging methods .
In 2019, Bracke et al. at The Bank of England published an explainable AI use case for credit risk featuring GBM and Shapley values . Later the same year, Bussman et al. published a similar piece, introducing a GBM and Shapley value example in the journal Credit Risk Management . In March of 2020, Gill et al. published a mortgage lending workflow based on monotonically constrained GBM, explainable neural networks, and Shapley values, that gave careful consideration to US adverse action notice and anti-discrimination requirements .
It now appears possible to take cautious steps away from trusted GLMs to more sophisticated GBMs. The use of constraints, post-hoc explanation, and discrimination testing enables you to compare GBMs to GLMs. These techniques may very well enable compliance with adverse action notice, discrimination, and documentation requirements too. And with a little luck, GBMs could lead to better financial outcomes for consumers, insurers, and lenders. As all this momentum and hype mounts for machine learning in regulated financial services, we hope that Parts 1 and 2 of this post will be helpful for those looking to responsibly transition from GLM to GBM.