Return to page

Building Explainable Machine Learning Systems: The Good, the Bad, and the Ugly

The good news is building fair, accountable, and transparent machine learning systems is possible. The bad news is it’s harder than many blogs and software package docs would have you believe. The truth is nearly all interpretable machine learning techniques generate approximate explanations, that the fields of eXplainable AI (XAI) and Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) are very new, and that few best practices have been widely agreed upon. This combination can lead to some ugly outcomes!

This talk aims to make your interpretable machine learning project a success by describing fundamental technical challenges you will face in building an interpretable machine learning system, defining the real-world value proposition of approximate explanations for exact models, and then outlining the following viable techniques for debugging, explaining, and testing machine learning models:

*Model visualizations including decision tree surrogate models, individual conditional expectation (ICE) plots, partial dependence plots, and residual analysis.

*Reason code generation techniques like LIME, Shapley explanations, and Treeinterpreter.

*Sensitivity Analysis.

Plenty of guidance on when, and when not, to use these techniques will also be shared, and the talk will conclude by providing guidelines for testing generated explanations themselves for accuracy and stability.

Open source examples (with lots of comments and helpful hints) for building interpretable machine learning systems are available to accompany the talk at:


Talking Points:




Patrick Hall, Senior Director for Data Science Products,

Navdeep Gill, Software Engineer & Data Scientist,

Read the Full Transcript

Patrick Hall:

Okay. Hi, good evening everyone, my name is Patrick Hall. I try to lead the interpretability efforts at H2O. So the talk tonight is going to be about interpretability for machine learning models and to be clear from the very beginning, we're not really going to touch on deep learning. I think a lot of the academic work in this area touches on deep learning, we are much more concerned about, in particular tree-based models on IID data regarding human behavior is someone going to pay their bill? Is someone going to default on their loan? Is someone going to, you know, turn from one phone company to another phone company, those kinds of problems. So I will say we are very interested in moving into interpretability for deep learning. And if that's something that you are interested in, maybe we would be interested in hiring you.

So if you have, if you have kind of a substantial resume around deep learning academic publications, we would be interested in talking to you about that. If that's something you're interested in or lots of experience in the industry. okay. So the title of this talk is Machine Learning Interpretability, the good, the bad and the ugly. And it's titled up for a reason. I think there's a lot of good things going on when it comes to interpreting this certain kind of complex predictive model. And I think there's also a lot of bad and ugly things going on when it comes to interpreting this certain type of predictive model. So I'm going to start, you know, from a very broad non-technical perspective and slowly move into details. And I would say if you have kind of a clarification question as we're going along, just raise your hand, we'll try to clear it up.

If you have a kind of longer discussion that you'd like to have, I'll be around after the talk and I'll be at the H2O booth tomorrow. Okay. So this is not a one-man show, lots, lots, and lots of help from lots of very brilliant people. So I'm very lucky to work with a very talented team of scientists and engineers at H2O, and we're also very lucky to work with some of the biggest names in statistical learning as our academic advisors. And we've worked hard enough or we've been lucky enough, probably some combination of both, that our work around interpretability has been to a certain extent embraced by the community. And we've been doing lots of talks and tutorials at various conferences this year and last year. And so that's been great.

So I need to ask this actually, does everyone here know what machine learning is, before we talk about machine learning interpretability. I know no one's going to raise their hand and say they don't. Now I should have asked it the other, I should have said, who knows what machine, let me ask. Let's just pretend I didn't say that. Who knows what machine learning is? Okay. All right. All right. What is machine learning interpretability? So assuming, you know what machine learning is, machine learning is this sub-discipline of AI that learns these very complex functions from data that's typically trying to, predict or classify or discriminate between different types of phenomenon. Right? So the problem is that these functions become very, very complex and difficult to interpret difficult to explain to your boss how they work.

And when I say difficult, I mean, borderline impossible. Difficult to explain to your boss how they work, difficult to explain to your coworkers how they work, difficult to explain to your customers how they work. So they tend to be very accurate predictors, but very difficult to explain. And so there's lots of definitions about interpretability actually that this is just the simplest one. And I'm a simple man. So this is my favorite definition of machine learning interpretability. So we're looking, for a way to present or explain to a human being what the complex mathematical function that's been learned from data is doing. And if you have not read "Towards a Rigorous Science of Interpretable Machine Learning", it's not even a long paper, I'd highly recommend it. It's very approachable even for, say the business analyst community would probably be able to get something out of it.

Very, very crucially of course, H2O are not the only people working on machine learning interpretability, there's a large group of academics that work under this acronym FAT, fairness, accountability, and transparency, and machine learning. And they have a great webpage that goes into a lot of detail about what interpretability is exactly, why it's important exactly, so I would urge you to check that out. And then another group of researchers, I live in Washington DC, so maybe I just run across these people more than you do. Another very prominent group of researchers in this field is from DARPA, the people who brought you the internet. So, it's the military basically. and you can imagine why the military would be very interested in, they're especially interested in Deep Learning Interpretability, they're especially interested in training, sophisticated pattern recognition technologies, and then understanding how they work and when they might fail.

And you can, you can imagine why they might be interested in this. But they tend to call their program XAI or Explainable AI, and they have a nice website that talks about some of their unclassified goals as well. Okay. So why should you care about this? This is one of my, the quote is maybe a little non-sequitor, but I think it's really important. This is not just about helping banks use more accurate predictive modeling algorithms, even though that's basically what the commercial drive is about. In general, artificial intelligence promises us more convenience, organization, automation, in our day-to-day lives. And as these technologies become more and more important in our lives, we'll probably have more questions about how they work and especially we'll have questions about when they go wrong or if they send us to jail or don't let our kids into college.

So there is a very, very important social aspect to this problem. Now I work for a commercial company, mostly I'm interested in helping regulated industries that have previously not been able to adopt machine learning technologies due to heavy regulatory burdens, help them use these very complex predictors. That's my, you know, professional goal. I think it's very, very important to think about the ways that these artificial intelligence systems will be impacting our day-to-day lives. And we will certainly want to know how they're making decisions as they become more and more prevalent in our lives. And I think another aspect of this that's very important, we've all seen something about this recently, most likely, as security, cyber security, IT security issues become more and more prominent in our lives. And as machine learning plays more and more central roles in these complex IT systems, it will be very important to be able to debug these systems to understand if they've somehow been tampered with. And it's very, very difficult to do that without having deep insight into the internal mechanisms of the system, whether the systems have been tampered with, whether the inputs of the system are being tampered with, or the outputs of the system are being tampered with. So, I think there's both very important social and commercial motivations for this problem.

What Are Reason Codes

Okay. So why haven't we just been doing this? Well, when was machine learning invented? It was invented in 2006 by Google and Facebook, right? No. Who wants to give an answer? Who wants to give an answer? Who's got answer. What's that? Who's got the answer. Yeah. I would say 1940s, 1950s personally. Okay. But you may have other opinions. So it's an older technology, why haven't we been just doing interpretable machine learning since the beginning? Well, because it's difficult. I'm going to talk about two reasons why I feel it's difficult that I've run into in my undertakings in this, in this direction. And you, you may run into other problems cause I think it is a fairly fraught and difficult problem to solve. So machine learning algorithms create functions that intrinsically combine and recombine variables until they're interacting in very, very sophisticated ways.

Okay. So the one goal of machine learning is, I'm sorry, one goal of machine learning interpretability is to disaggregate a prediction, to form something called reason codes. So if you've ever been turned down for a credit card or even looked at your credit report, the credit rating agencies or the credit lender has to, if they turn you down for a credit card, they have to give you, I think, in the US, it's five reasons why they turned you down. And those are called reason codes. And those are related to your input, the variables that went into their credit scoring model that decided you probably wouldn't pay your bills. Okay. So if you're turned down for a credit card, you might see something like your length of credit isn't long enough, your savings account balance isn't high enough things like this. Okay. It doesn't say the sophisticated interaction between your savings account balance, your length of credit history, your debt to income ratio, and five other variables are why we turned you down. Right? We have to break it down in simple terms, and those are called reason codes. And that's sort of fundamentally, at odds with the way machine learning algorithms work. Machine learning algorithms intrinsically consider high-degree interactions between input variables. And so disaggregating a prediction into single feature contributions is a difficult thing to do. Okay. And potentially even questionable. So, one thing to remember is that a lot of these approaches that will talk about later are approximate. Some are very approximate.

Difficulty of Machine Learning Interpretability

Another reason that I find machine learning interpretability to be difficult is because, when we move, so I'm just going to talk loud. So what I'm trying to illustrate here, it's a little, a little advanced, but I think you'll get it. So when I train a good old-fashioned linear model, I'm a big fan of linear models. I'm a big proponent of linear models. When I train a good old-fashioned linear model, that's called a convex optimization problem. And what that means is given my input data and some numeric value for like, for a parameter, for interest rate, let's say for example, and for a model parameter for income, some combination of those model parameters in your input data will lead to the lowest possible error state. And that lowest possible error state if I was to draw a straight line here and over here straight to this axis, you know, these would be the values for my income and interest rate parameters in my model.

Okay. And you can see there's only one best model. It's not quite as simple as this real world, but it's not far from wrong actually. Now, when I go into the machine learning world, I can actually have many very good models for the same data set. And this is a well known phenomenon and it's sometimes referred to as the multiplicity of good models. So for even a well-understood data set, there's many, potentially an infinite number, of machine learning models that can give you good predictions. So when I go to explain one of these models, it's very crucial to remember I'm just explaining the one that we happen to be using, the one model out of many, many, many good models for this problem, I'm just explaining one of those. Okay. So those are the two reasons why I think it's difficult. So why given these fairly fundamental problems, why would we even try to do something like this? I do think it's valuable, and this is why. So good old-fashioned linear models and actually some aren't even old-fashioned anymore, linear models tell us about average behavior.

And that's great because it's very easy to explain averages. So I like to think of linear models as approximate models, but models with very exact explanations. And if you ever took a college statistics class, you'll probably remember how to interpret a linear model.

Interpreting Linear Models

So here's my linear model, this straight line, okay, and it's trying to model the number of purchases, which seems to go up as someone gets a little older and then eventually start to go back down. Okay. The linear model is only able to model a sort of global phenomenon, right? So it can really only make this straight line, the straight line does a good job, it tells us about the average, basically, as someone's age goes up, they buy more things. But it misses this kind of jump, let's say this is when someone first gets a job and starts making money. And then it misses this decrease, let's say as someone retires. Okay.

But it has a very consistent stable explanation. And if we know the functional form or the exact parameterization of the linear model, we can even say things like for a one unit increase in age, the number of purchases increases by 0.08 on average. Okay. You might remember that from high school or college. Now with machine learning, we are, if you know what you're doing, you might be able to build a more accurate model that estimates the actual unknown function for how someone's purchasing behavior changes with age a little bit better. Okay. The problem is these models get so complex. They become difficult to interpret. I can't really, how could I explain this line to you? Like, well, it goes up, and then it kind of plateaus and it goes up again. Thanks.

Thank you. So it's a little bit hard to describe this black line and, what I can do though, and one technique in machine learning interpretability is to say I'm not going to try to explain this whole line to you, I'm just going to try to explain little parts that are important and the way I do that is with a linear model. So, you can see I might actually get good insight from these approximate explanations of a more exact model. Right. I can now see, oh the slope is increasing very quickly here market to these people. Oh the slope is decreasing very quickly here, maybe cut back marketing to these people. Right. So I think there is some intrinsic value here to the explanations, even if they're difficult to make and or approximate. To be clear, I think both of these types of modeling, exercises, and adventures in explanation are useful.

Real-Life Uses of Linear Models

Okay. So how can we do this in real life? Well, I think one of the most important ways to start this is by understanding your data and there's lots of ways to understand data. One particularly convenient way is through accurate data visualization. So there's lots of ways to visualize and understand your data set, these are just two of my favorites. So I very much like the 2d projections that you're seeing on the left because I projected a 700 and 84, 86, I can't remember, 28 times 28, dimensional data set, down into two dimensions. Okay. So that gives me a way to actually see and maybe understand things in my data such as hierarchy, sparsity, clusters, outliers, and all these things are things that would be impacting my model and things that I would expect my model to learn if it did a good job.

Okay. So basically we're starting from the very beginning and we're saying I need to get some understanding of my data so I can check that my model understands my data, at least as well as I do. Now the graph on the right, it has a lot of different names, I call it a correlation graph. That helps us understand the relationships in the data set better. So the graph on the left is about understanding structure that we would hope a model would learn. The graph on the right is about understanding relationships that a model would learn. And again, I particularly like both of these graphs cause they put high-dimensional information into just two dimensions so that we can look at it, which for cited people is very convenient. And another thing I like particularly about the correlation graphs is that I can see high-dimensional relationships, right? Just in two dimensions. I can see several different groups of variables that are highly related to each other and related to other groups of variables. And these are the exact types of relationships that we would hope a well-trained machine learning model would pick up on.

Training Interpretable Models

Okay. Another way that machine learning interpretability can be practiced is simply by training interpretable models. And I see people taking pictures, please do, but a link to the slides is on the meetup page. So you're welcome to take pictures, but just the slides are available as a PDF also. So I'm not going to go through all the links on this page, but every link on this page is open-source software that's capable of training accurate and interpretable models. So no excuses, there's lots of different things to try. In broad classes, these are decision trees, monotonic, gradient boosting machines, which are personal favorite of mine. We'll talk about that a little bit more later, I'll take kind of a sidebar here and talk about monotonicity for a second. Monotonicity means that like as one input goes up, the output of the model can only go up or as one input goes up, the output of the model can only go down.

So this is actually, I think this is a huge part of interpretability. I can say to my boss something along the lines of, as age goes up the number of items that the customer purchase also goes up. I don't have to say it's a very complex relationship, I can kind of leave those details out, but just that one little sentence seems to break down a lot of doors. Lots of new kinds of linear models, and I just did a training at a big bank last week and I always take you can probably tell I'm a very sarcastic person, right? So I always show a picture of Gauss in the 1800s and a picture of Trevor Hasty in the 2000s and say, you know, a lot of linear modeling techniques that you guys are using are from the 1800s while there are options from the 2000s that were invented to address all the shortcomings of the techniques you're using from the 1800s.

So, just like to point that out. So also rule-based models, very powerful non-linear, but interpretable, if you can keep them simple enough. Rule fit by Jerome Freedman is something I like to point out to people that's kind of a hidden gem. And another hidden gem I think is super sparse is linear integer models. And I don't know of any open-source software for this, but these are very, very powerful models. These are these kinds of models, that's like, well, the famous one is for newborns, right? And it's about the color of their skin, their respiration, are they crying and kind of each of these categories gets a number and then you just add the numbers up and if it's above some number, it's okay. If it's not above some number, you need to rush them to the NICU. Right. And so super sparse linear energy models are very, very powerful kinds of models that I think people should take more advantage of because they're meant for people to be able to do the calculations in their heads in the field. Okay. So, I think they're fairly difficult to train and that's maybe why you don't see as many open-source packages around them, but a very powerful technique.

Okay. Back to visualizations. So some of you may know one of the advisors that we work with, Leland Wilkinson is the original author of Grammar of Graphic. So I have direct access to one of the best data visualization minds in the world, and I try to shake advantage of it. We have two plots here, and again, I've chosen them because there's lots of ways to do this. I've chosen these two because I think that they are particularly useful. So, the plot on the left is what I call a decision tree surrogate model. Some people call them explanatory models. Some people call them shadow models. The basic idea is that they are a simple model of a complex model. Okay. So let's say I trained some big fancy gradient boosting machine with 4,000 decision trees and it took all night to train.

And now I have to go about explaining this. Well, one thing I might do personally, and again, this is an approximate technique. One thing I might do personally is train a single decision tree on the inputs to the complex model and the predictions of the complex model. Not on the actual ones and zeros did they pay their bill or not, you could do that as well. That's totally fine actually. But the thing I like to do is to train a simple model on the inputs to a complex system and the predictions of a complex system. And that way I'm making a simple model of a complex model. And then it's very helpful if that simple model has some kind of convenient graphical form that it can be displayed in. Right.

This decision tree I'm just trying to talk about, I think that mic will come on a second. So what this decision tree tells me is that pay zero in this case, in this model, someone's most recent payment status, is probably the most important variable in the model. And I can see basically if somebody pays their bills, they go down this side of the flow chart, according to the model and they end up with a low probability of default. But if someone is more than two months late on their first bill, then they go down this high probability of default side of the surrogate decision tree. So and again, if something is below or above each other in the decision tree, then I can see that they're interacting. Right? So that's a big part of this is being able to detect multidimensional interactions. So I can see that there's some three-way interaction between someone's most recent payment, someone's second, most recent payment and someone's sixth, most recent payment.

I can't really tell you if that's right or not. The point is this is a tool that would show this to you so that you, as the domain expert could decide if that was right or not. I do really like that. Again, I'll emphasize this combination together. So the thick red line in this plot is something called partial dependence. It basically tells me the average prediction of the model for a certain variable. So when the age is 40, the average prediction of this model is about, you know, 0.25, something like that. Okay. The other lines that you see on this plot are called ICE, individual conditional expectation. And they essentially show how one row, one person in this model behaves. So this would be a very low probability of default person. And if I take that person and I sort of simulate them, having different ages, they stay very low probability of default.

What's more, this is all good to see, but the really important thing about ICE is partial dependence can be misleading in the presence of high-degree interactions. And so if I see my ICE line sort of crisscrossing with my partial dependence lines that tells me, hey, this average behavior might be misleading. And then I can go back and check in my surrogate decision tree model to see if there's interactions going on. And of course you can train it deeper than just three levels, right? I just picked three levels. So it fits in the slide. You know, five levels might be good to see these kinds of interactions, something like that. So the key here is using these two visualizations side-by-side to get an approximate overall flow chart of how the complex model behaves to see how variables behave on average and then to see how individuals are behaving in the model. And if the behavior for an individual diverges from the average behavior, that's a strong indication that interactions are at play. Okay. So all these visualizations I've shown are ways that I think are good to get peaks into this weird high-dimensional world of machine learning models.



Patrick Hall:



So, first, second, and sixth, that's an interesting structure to build your decision tree chart. Why for second and sixth? Why for second and sixth? That's all.

Patrick Hall:

So I can't, it's a very good question. The point of this, of all these tools, right? I make tools.

Why Is There A Relationship Between Payments?

Yes, sure. The question was why this relationship between someone's most recent payment, someone's second most recent payment and someone's sixth most recent payment. So the point of this really is to show you these tools so that you can use these tools on your models and then you will see these kinds of interactions. And then you, as the domain expert take this variable out. I need to down weigh this variable, something like that. Is that a fair answer?


Sure, sure. Okay. Sure.

Patrick Hall:

All right. Now we're back to this reason code idea.

Reason codes and I just don't know people's backgrounds, reason codes are typically, roughly related to the beta X contributions of logistic regression models. Okay. So in the linear model credit scoring world, which for the record, I mean, I think credit scoring at large institutions, not FinTech, but credit scoring at large institutions in the US is one of the most regulated modeling exercises in the world. And I think that's a good thing. And I think that the credit scoring credit lenders in this case actually deserve a lot of credit because, for years, they've gone through with the different government agencies and worked to make sure, at least to a certain degree, much more so than other people, that their models are fair, non-discriminatory, and explainable. Okay. They're still trying to make money, but they have to jump through a lot of hoops.

Open-source Packages for Generating Reason Codes

So, now the question is if I can take a machine learning model that's about, you know, two to five percent more accurate for some large lending portfolio, that's a lot of money. I can make a lot of money or I can save a lot of money, but to do that, I have to be able to generate reason codes for why the model made the decision it made. And for you guys, there's three decent open-source packages for doing this. So one is called Treeinterpreter. One is called SHAP and one is called LIME. And there's a lot going on here, but one, the reason codes work by sort of giving the input variables attritions for each decision that's made. Right?

If we look at this, this is output from LIME. And I think it's about whether a mushroom is edible or poisonous. And so you can see it's sort of ranked the variables. It's given them a numeric contribution and then used that numeric contribution to rank them in importance in any one given decision. And so we would say, the model says that this mushroom is poisonous and the reasons why are because it has a foul odor, something about stock surface, something about spore print, something about stock surface you know. So these would be the reason codes and anyone that's actually in banking is like, give me a break. These would be the reason codes for this machine learning model decision. Okay. So we're trying to, for each prediction, we're trying to give an attribution to the input variables. We're trying to disaggregate this very complex function that interacts many variables. We're trying to disaggregate those interactions into single variable contributions for every decision and then rank them so that I can say, this is the most, this is why we can't give you a loan. Okay. This is why we turned your credit card application down. This is why we're not letting your kid go to this college. This is why we're sending you to jail. You see?

SHAPley Explanations

Okay. So, these reason codes are important and there's no silver bullet here. The closest thing I've seen to a silver bullet is SHAP, SHAPley explanations. And I can talk about that a little bit more, maybe I should talk about that a little bit more. So, LIME, local interpretable model-agnostic explanations, LIME can be used on any model. Okay. And it's that picture that I showed in the very beginning where I'm fitting a linear model to a very complex function and then using that linear model to build explanations about that local region of the complex function. Now, SHAPley is very different and I may say better. Okay. SHAPley is really best for decision tree models. Okay. So LIME is model agnostic. It just needs inputs and an output, and it can generate an explanation. That's very convenient.

To get, in my experience, to get the best results out of SHAPley, you really need to be using a decision tree model, and SHAPley basically takes a row of data and follows it through a decision tree or 4,000 decision trees and keeps track of which variables are contributing as the row moves down the decision trees. Okay. SHAPley values go back to game theory, SHAPley values are the contribution of one individual to the outcome of a game. And so this really smart kid, Scott Lundberg at the University of Washington was able to sort of back all of these ideas about attributing, local importance to variables in a model prediction to this strong theoretical framework from game theory. So, and in our testing, these have been the most accurate reason codes. Okay. It's still a crazy thing to do, to take a model that just combines and recombines variables literally millions of times, and disaggregate that into individual variable values, how they played into each prediction. But if you're going to do that and if I'm going to do that, I'm probably going to use SHAPley. Okay.

And another, so SHAPley is in XGBoost. Who uses XGBoost? So go digging around an XGBoost and you'll find SHAPley. I don't know if it's documented yet or not, but that's how we do it at H2O. We just went digging through the deep C++ source code of XGBoost and figured out how to get SHAPley values in there. They're in there, they're in there and there's also a Python package. There's also Python packages. All three of these are Python packages. I think there's an R package for LIME also. I think any of the, if you are outside of regulated industry, I think any of these are great, and I think you should try them. If you're in regulated industry, my strong recommendation to you is to check out SHAPley.

How Does SHAPley relate to Gini Importance?


How does that relate to, in random forest, we have to Gini importance. How does that relate?

Patrick Hall:

Sure. So the question is in random forest, we have Gini importance and how does this relate to Gini importance? So do you mean Gini importance for the whole model? Cause that's what I'm going to talk about next. Okay.

Variable Importance

I'm going to try to remember to answer your question. Okay. So there's this older technique. That's also very important. Very good. You should use it, where we're just trying to decide how important a variable is in a model overall. Okay. And typically this is called variable importance or feature importance. If you're feeling cool, you can say feature, not variable. If you're a statistics conference, you say variable. If you're at O'Reilly AI, you say feature. So there's kind of a shorthand to how this feature importance works, which is the higher a variable appears in a decision tree. And the more times it appears it will have an overall more important, it will be overall numerically more important in the model. And then the lower the variable is in the trees and the less often appears, it will be less important in the model. Okay. So if we go back to the difference between this picture and the previous slide, so we are essentially, in this slide, we are trying to, we want to build this type of plot for every single row. Do you want a more technical answer than that?

Cause I can. So I'll bring it. So in variable importance, we keep track of, how many times the variables used to split and the impact that that split makes on the overall prediction, whether it's Gini or information game, whatever it is, we sum that across all the trees and that's the variable importance. Okay. The basic idea here is that I take a single row, and then I track that one row through all the trees in the forest or the GBM and I keep track of which variables were used for that row. Does that help? Okay. So basically we wanna make this chart for every single row and the chart can be different for every single row. Okay.

Sensitivity Analysis

This is one of my favorite things to harp on. So sensitivity analysis, that just means putting new data in your model and seeing what it predicts. Okay. It's very, very important. And I think this is a great picture. This comes from a textbook from the mid 2000s that says machine learning is not suitable for extrapolation

So just remember just all the talks you go to tomorrow that are talking about using machine learning for doing prediction. You're not supposed to do that. I'm just kidding. I'm just kidding. But, it is true that machine learning models can make very strange predictions, especially outside of the range of the variables that they were trained on. Okay. So I highly, highly, highly recommended if you are using machine learning and don't want to lose an incredible amount of money that you test, say like incomes 10, 20% below what they, what you saw during training and incomes 10 or 20% above what you saw in training or negative ages or ages of 120. There's just, unless you test it, there's very little way to know, especially in regression problems where you've used a machine learning algorithm, like on this slide, the functions try to match the training data and they will bend themselves into all kinds of crazy shapes to do so.

Okay. They don't know what happens outside the training data. And, if you don't test it explicitly, you might have a very unpleasant surprise. And I'll say, I'll say this. Another thing that I see in practice, a customer bar is going through a 30,000 by 30,000 correlation matrix on an Excel spreadsheet. Trying to find hidden correlations before they used a gradient boosting machine. That is exactly backwards. Okay. So the reason that you would use a highly regularized machine learning model is so that you wouldn't have to worry about tiny, hidden correlations in your data set. So if you are transitioning from the linear model world where you kind of torture your data set to find these hidden correlations that might make your parameters and production unstable, if you're transitioning to machine learning, you can ease up on that and instead do this kind of testing, okay? Because the machine learning model, once it's trained, once you've picked that model one out of a million that are good, that you're going to use, they tend to be highly stable, especially if they've been regularized, which most decent machine learning libraries allow you to do, they tend to be stable. You don't have to worry as much about parameter stability. You have to worry more about prediction stability with machine learning models is my opinion.

Stop going through 30,000 by 30,000 correlation matrices in Excel spreadsheets. And use Excel to do sensitivity analysis for your GBM models.


Can you reiterate traditional learning?

Patrick Hall:



Yeah, please. I know nothing.

Traditional Learning

Patrick Hall:

So, the type of models that most people still use are highly susceptible or can make bad predictions when they're used on new data out there in the world, right? It gets trained on my laptop, but then I try to go use it to decide if I should give people loans or not. Right? One thing that can go wrong with the type of models that people use mostly today is if two variables are too related.

Okay. But when we switch over to machine learning models, they don't care as much about how much the variables are related, cause they're going to try relating them all in all these different ways anyway. Yeah. But they can make very crazy predictions. And so you have to test them explicitly to make sure they're not going to make crazy predictions. Let's say the highest income they saw in a training data set was $400,000. Then they come across somebody whose income is a million dollars.

Yes. So what I'm, you know, then it shoots down, right? I'm expecting that the line just kind of goes like this. I'm expecting that the line kind of goes like this, but instead it shoots down here, shoots up there. It's very.

It's something you have to watch out for.

Importance of Testing

Another important thing here is that I've talked about how all these explanatory techniques are somewhat approximate. It's very important to test them. So some of the original papers, including that...

You know, towards a rigorous science of machine learning interpretability paper, a lot of the original papers suggested, and I think this is great if you can afford it, if you have time suggests basically human trials, right?

You train your model, you show a human the explanations, and then you show a human a new row of data. And you say, tell me what the model's going to do. And if they can tell you what the model's going to do, then we say, that's a good explanation. Okay. Obviously, these kinds of human studies are very expensive and time-consuming. So there are more automated ways and something we've pursued at H2O to test our own interpretability software is with simulated data. So we will, it's very easy, we'll manufacture data sets where I know like X1, X9, X4, and X8 are important.

And I know the functional form of the model. So I can see for any given row, what should be more important. And so we can test them against this, simulated data. So I highly recommend that. That's been somewhat successful for us. Another thing I would recommend is to test that these explanations don't change too much. Right? So if I perturb my data just a little bit, add just a little bit of random noise in, and I see my explanations jumping all over the place, that's not a good sign, right? So we need to test for stability. So the first one kind of helps us test for accuracy of explanations. The second one helps us test that those explanations are stable then the bottom one is just an idea I had, honestly, you could try if it sounds interesting to you. If I'm some credit rating agency and I basically know what my reason code should be and have known for 20 years for everyone in the US. Right? Then I can kind of start to see when my new machine learning model, reason codes start veering away from that. So if you are working in a business where you have reason codes or local explanations that you trust, because you've learned that they're okay over a period of years, then you can use those as a benchmark and kind of slowly move away from those and see how accurate a model has to get before you change away from those existing explanations.

Oh, and I should say it's on the book. Get a book if you didn't get one on the way in.

Recommendations for Real-Life Practice

Okay. So here are my recommendations if you're going to do this in real life. So just in general, consider deployment. So all those nice little open-source packages that I showed, they work really well on your laptop. There's no sense of, okay, I've moved this model off my laptop onto my big secure server. And I'm now scoring hundreds of credit card transactions a second, and I need reason codes for those. That's a hard problem. We solved it internally, just kind of by hooker by crook, and we'll sell you software that solves that problem. So come see us later. You can figure out a way to do this also. It's just not, and this is very typical of open-source machine learning software outside of H2O, that it's very difficult to, I can get the answers I need on my laptop, but when I move it into a production system, I'm kind of left high and dry.

So consider that, look into deployment options before you get too deep into this. Okay. So if I was going to do this today and I didn't work at a wonderful company, like H2O, that has their own machine learning library and had a proprietary package that just does all this automatically. I would go to, XGBoost and I would train in a monotonic GBM, a monotonically-constrained GBM. And, I would use SHAPley explanations. That shouldn't be too hard. That's actually a pretty direct way to accomplish most of the things we've talked about in the slides. Okay. It's very important to use both these local, so local basically means the reason codes. Okay. So it's very important to use those reason codes and more global explanatory tools because sometimes the model's behavior in local regions is just too weird.

Okay. Because the model will do things. It'll get one row wrong to get 10 rows right. So are the explanations for that row that it got bizarrely wrong going to be very useful? Maybe, maybe not. So it's very important to have these sort of more trusted, older, more well-understood global methods, to sort of as a backup to the newer, less well-understood, potentially more fraught local methods. Thank you. Thank you. So sensitivity analysis, we talked about sensitivity analysis. That's when you test your model. Okay. So this is like production software. It needs to be tested. Something that we like to do at H2O, and I don't know if we made the software, or if we read it somewhere or something, called random data attacks. So we just take our systems and just over the weekend, just expose them constantly to random data.

And we find all kinds of interesting things. So I highly recommend that technique as well. That's not so much about model performance. That's more about, so what happens when a variable that's supposed to be a number is a character, but the model still makes predictions. And then you're using that prediction to make some business decision. Right? So random data tax is a good one that I would recommend. And then test, we talked about testing. So by testing, I mean, sensitivity analysis, and I mean, this last slide. In our experience, we are starting to believe that model specific local explanatory techniques are more accurate. So again, the model specific techniques require they're only able to be used on certain types of models. Okay. So, the Treeinterpreter is an example of a model-specific technique.

It can only be used in a certain type of model, but we find that those techniques, unlike LIME, which can be used on any type of model, we find that those, if you have the luxury of using a model-specific technique, they tend to be a bit more accurate. All those open-source packages that I showed, I don't think they're not being used like the most popular R packages or something like that. They have some fairly noticeable flaws that will, that will become evident to you very quickly. Just for example, when we were trying to do this testing exercise, I couldn't find a single set of open-source packages where I could test LIME Treeinterpreter and SHAPley on XGBoost. We had to just write our own LIME package that actually worked on XGBoost to be able to do that.

So it's very nice software, in some cases written by the leading researchers in the field, it's just, I get the impression that not that many people are using it and if corporations are using it, they're probably fixing, you know, doing their own fixes and keeping them internally or something like that.

Uninterpretable Features

And then, this is a good one. So you have to be aware of uninterpretable features. So I go to all these pains to make an interpretable model, but then I have features that are from a, you know, 17 layer stack de-noising autoencoder, or these features, you know, distance from a cluster central in some kind of principle component space or something like this. So you have to also use interpretable features or your model won't be interpretable. So, you know, you can say, oh, as the middle two layers of this 17-layer stack de-noising autoencoder, as that value increases, my model can only go up. Nobody cares. It's not interpretable.

Recommendations For LIME

Okay. So here's some specific recommendations and observations about LIME. The, to me, the greatest thing about LIME is it can give you an indication of its own trustworthiness. So you can look at the fit statistics of the LIME model. And if the R squared is 0.3 and the LIME model is predicting 0.7, and the actual prediction is 10. That's a good indication that the reason codes that the LIME model is generating are not particularly accurate, right? So, LIME is nice because it can tell you, hey, this is when I'm trustworthy. This is when I'm not trustworthy, at least more so than other techniques. LIME can fail. And I have examples of this that I can show if people are interested. And our guess is it's just particularly in the presence of extreme non-linearity or high-degree interactions, because LIME typically does not consider interactions and it's a linear model.

LIME is difficult to deploy. So this is, again, this thing, like the open-source packages are nice on your laptop. They can be hard to use if I'm trying to tell somebody why I'm not giving them a credit card in real time. The procedure for LIME is to pick a point that I wanna explain, draw a sample around that point, fit a linear model in that sample, and then use that linear model to make explanations. So that means when I'm scoring new data, when I run my credit card, I have 300 milliseconds to do all that stuff. Right. And that just doesn't happen. Okay. So, we have a variant of it at H2O that is deployable, where instead of trying to form samples in real time, we just use predefined clusters in the training data. So there, it's just something you keep in mind if you're going to use LIME.

And I think LIME is great. I talk to the inventor Marco all the time. So another really important thing about LIME is that the reason codes are offsets from a local intercept. So what that can mean is, let's say I'm in a local region because I'm rich and that would be, or, you know, in my case, that's not true, but, you know, let's say I'm in a local region in some data set because I'm rich. So, it's very possible that the intercept of the model will just account for that and most important local attribute. And then your reason codes will just be offsets off of that. So you are, that may have been me. I don't. So you are just, how are we on time? Okay. Oh, I need to go. Okay. All right. So are we real bad?

Yeah, we're bad. Okay. All right. I see. Sorry. Sorry. Sorry. Okay. So, yeah, so the reason codes are offsets from a local linear model intercept. So just keep that in mind. Always try LIME on discretized features or try LIME on interactions. Right? If you know that an interaction is important, just try it in your LIME model. and then another thing we like to do is use cross-validation to construct standard deviations for our confidence intervals. Sorry, we're getting pretty, this is the part of the presentation for the, you know, people who knew all the stuff I talked about for the first 50 minutes.

Recommendation For Other Popular Packages

Okay. So, here's my recommendation on the other two popular packages. So the problem with them is they can't tell you when they might fail and that's kind of troubling, right?

We don't expect things to be perfect all the time. So, we don't know when to not trust these types of explanations. We have seen in particular that Treeinterpreter does appear to fail with regularized XGBoost models. So if you're using L1and L2 regularization with your XGBoost models, which is a good idea, your Treeinterpreter explanations can seem to be numerically odd. And I did, I filed a bug on the Treeinterpreter repo. They haven't gotten back to me.

SHAPley does handle regularization. And I think if I was working at a bank or a credit rating agency or an insurance company or a hospital, SHAPley's the one that I would go to, okay? Has strong, theoretical support. And the implementation that we've looked at seems extremely robust. We have not seen it fail and we've tried it in very, very sophisticated scenarios. And then instead of being offsets from a local intercept, reason codes are offsets from a global intercept or a global average value, which I think actually makes a lot more sense. So, okay. We're done. We're done, go to this web page. There's more information than you ever wanted to know.


All right. We can take a few questions, cause we're going to get kicked off here. Any questions?

Mic? Thank you. All right.

Explainability for Blocking and Allowing Purchases in Fraud Detection


Hi, thank you for the presentation. So, I actually worked in a case for fraud detection, where we wanted to give some explanations on the decision and you will always trying to give the two sides ofthe equation. So you typically have a score and it's a gray area. You can have reasons for blocking someone from doing a purchase and reasons for not blocking them. How do you deal with these two sides of the equation where you're not just explaining why you're blocking, but you're also explaining why you might not want to block.

Patrick Hall:

So we and our software just always give both. Okay. But I think that the standard practice is to say, oh, the model says this person's really likely to default. So we give the reasons they're likely to default, right? This person's really likely to pay their bills. These are the reasons they're likely to pay the bills, but we just show them all in our software.


Do you deal with situations where you're giving explanations for the two sides that are contradicting in some sense?

Patrick Hall:

Yeah, I just don't. I come to the booth tomorrow. I'll show you what we do. We just show all the reason codes. That's all.


All right. Two more questions.

How Is Model Interpretability Being Practiced In AI?


So, apart from LIME or SHAPley, how is model interpretability being practiced in AI nowadays?

Patrick Hall:

Well, like I said, a lot of the academic research is in deep learning. Right? You can go drown yourself in interpretability for deep learning papers on archive or in JMLR, NPS, wherever you want. So if you're interested in deep learning, it's great because there's a lot of academic papers for, you know, as I mentioned at the beginning, for, you know, tree-based methods, rule-based methods, things that tend to be used, in commercial practice. In New York city, that seems to be being done more by software vendors. And I think, the booklet gives a pretty good summary of all the different techniques that people are using.


All right. Last question. We'll be around for, no. Okay. You'll be the last one. We're going to get kicked out and we'll be around to answer questions in the end, Patrick and Navdeep and our team members are here

Patrick Hall:

And we'll be at, I'll be at the H2O booth tomorrow morning.

When Should The Community Contribute To Human Life Critical Models


So if there's a question about, human life critical and here we're talking about medical, but also certain scientific, let's talk about, launch to Mars, where there are human lives at stake. At what point does the community contribute to the discussion about what that looks like? What exactly that looks like, what the linear model versus the machine learning model looks like and which that we trust?

Patrick Hall:

So, and I think in these really crucial situations, I would just do both. I think that's the best, that's the safest thing, right? That's the safest thing is to have a very interpretable model and then a more accurate model and try to get interpretations of both of them. That's what I would do in that situation.


All right. Thank you, Patrick.

Patrick Hall:

You're welcome.