Interpretable Machine Learning
While understanding and trusting models and their results is a hallmark of good (data) science, model interpretability is a serious legal mandate in the regulated verticals of banking, insurance, and other industries. Moreover, scientists, physicians, researchers, analysts, and humans in general have the right to understand and trust models and modeling results that affect their work and their lives. Today many organizations and individuals are embracing deep learning and machine learning algorithms but what happens when people want to explain these impactful, complex technologies to one-another or when these technologies inevitably make mistakes?
Patrick Hall, Senior Director of Products, H2O.ai
Read the Full Transcript
Let's go ahead and get started. I'm Patrick Hall. I'm the speaker. I'm a co-organizer of this meetup, which is sponsored by the company I work for, which is H2O.ai. I am a Data Scientist at H2O, and I'm also adjunct faculty at George Washington. Honestly, this is the second time I've given this talk today, so if you guys want to give, the more interactive, the better. Please just ask questions to make it through. I'm going to have to assume some knowledge of machine learning and predictive modeling, and statistics. But don't be shy about asking questions. Then I'll say one more time. I'm really sorry; there's no drinks. If you want to drink, there's a cafe around the corner.
If you go out these doors and go around the corner, that's Dukas, and there's a little cafe where you can get a drink. The idea tonight is to discuss several different ideas about machine learning interpretability and what this deck is. My job at H2O, one of the things I'm tasked with is sort of being a product manager and a developer for software around machine learning interpretability. How are we going to make these machine learning models that we're developing? How are we going to make them interpretable to users? What you're seeing is the results of the literature search that I and my colleagues did before we got started making software. Before we got started making software, we spent about three months just trying to read up on everything that was out there.
This is not everything that's out there, but maybe it's 60% or 70% of what's out there. I should say the field is very hot right now. People email me papers almost every day. I think people are finally starting to catch on to the fact that if a model isn't interpretable, then in many cases, it's not very useful. I've been working with customers doing machine learning in two different jobs now and in many different countries. Machine learning fails in large organizations for two reasons, and it doesn't always fail, of course. But the reason I see it fail is that people underestimate how important it is to be able to explain the model that they built to their boss and to their colleagues, and to their regulators, and people underestimate how difficult it will be to deploy these complex machine learning models. We're not going to talk about the deployment part tonight, even though that's another thing I like to talk about. But we're going to talk about how to make these complex, non-linear machine learning models more interpretable. I should say this presentation is kind of a collection of tricks and some good tricks that we learned and that I've picked up in the past couple years. And then some really rigorous mathematical ideas also.
The talk is divided into three. Another thing I should mention is I'm only going to talk for an hour. I don't really want to torture people. We'll try to get through as much material as we can, but in case, and if you've read the article and you have questions, please ask me. A couple people have told me it is a really long article. We almost considered making it into a book. Everything in the talk should be in this article if you want to follow up on it. There's nice links and stuff in the article too, and references.
The talk is divided up into three parts. You can read them, but the idea being that for me at least, and for most cited people, it's very important to be able to see your data in two dimensions so that you can have an understanding of what your model should actually be doing. The idea is I see my data, and I understand it better, and then I can check and see if my model actually modeled my data correctly. The second part of the talk is about things I've seen people do or ideas we might have about how to mix machine learning and linear models or how to improve linear models with machine learning. Or how to make machine learning more interpretable so that it's suitable for use in regulated industry. There are people that just can't use some sort of complex, non-linear gradient boosting machine, random forest neural network type models at their job. But they still have to build more accurate models than they did last year. What are they going to do? Eventually, they run out of stuff to try. Maybe here's some new stuff to try. Then the last part of the talk is about interpreting very complex, non-linear, non-monotonic machine learning or machine learning response functions. Before we get started, can I not change the slides?
Framework for Interpretability
I think I'm a little older than I look, is what I get told very often. I learned scientific programming and Fortran, and things like PowerPoint. When we first started talking about what interpretable means, that ended up being a question that we spent quite a long time trying to answer. It's not straightforward of what does interpretable mean? What makes a model interpretable? We came up with these different dimensions of what interpretability would be. One of the ideas we came up with that's really important was sort of the scale of interpretability. Linear and monotonic models are the easiest to explain, we think. This would typically be linear regression, logistic regression, maybe naive base.
Then it turns out there's this sort of middle level of models that aren't so bad to explain at all. Those are non-linear but monotonic models. What does monotonic mean? What we mean by monotonic is that the change, the behavior of an input variable with respect to the target variable, either always increases or always decreases when I change the magnitude of the input variable in one direction. We actually ended up feeling like the monotonicity of the inputs with respect to the targets was probably more important than linearity when it came to explaining these complex functions. Now the hardest class of functions to explain are both non-linear and non-monotonic. If we train a neural network or a gradient boosting machine or a random forest, very often, we do end up with a non-linear, non-monotonic response function. How do we explain that? We'll get to that.
Another idea that we ran into a lot in the literature and then just made sense to us was this idea of global interpretability versus local interpretability. It's oftentimes not difficult at all for me to give you a sort of approximate explanation of how a very complex model behaves. I can tell you, in general, on the whole, on average, this is how this model behaves. But if I want to tell you exactly what it's doing, sometimes I have to sort of zoom in. I have to zoom into some sort of local place in the model. What does that mean? It can mean different things. Local could mean like the top five percentile of the predicted response. Local could mean like all the customers in a zip code that you're predicting on. It could be a cluster in the input data set. Local has a lot of different meanings. It's a simple idea. It goes back to real analysis, which I was terrible at. The stone wire trust theorem says that any continuous function can always be approximated by a polynomial, and a linear function is polynomial. Essentially, if I zoom in close enough to any complex response function, I can explain it to you with a linear model. That's the idea of some popular work that's come out recently.
The application domain tends to be very important. This idea of model agnostic versus model specific. If I pick an application domain like credit scoring and then I pick a model type like neural networks, it might actually be easier for me to make a credit scoring neural network that's interpretable. That's one thing we'll talk about later. It's harder and potentially more useful, or maybe not more useful, but we think it's a little bit more useful and more difficult to make tools that help explain any kind of model across any kind of domain. There's just this idea of whether I want a model specific explanation or a domain specific explanation or I truly want to make some kind of tool or use some tool that explains all kinds of models and all kinds of predictions. That's what we mean by model agnostic versus model specific.
Then another thing we thought about and sort of came to believe is that trust is different than understanding, and both are important. I can trust a model that I don't understand. If I have a model that's been out in production, it's been making me money or saving me money for years. Never had any problems with it, but I don't fully understand it. I might end up trusting it. I can certainly understand a model and not trust it. I can understand the complex inner workings of the model, and then through that understanding, know that this model isn't going to behave well and then not trust it. Understanding to us means getting some insight into the complex inner workings of the model. Whereas trust means I think the model is dependable and going to behave the way I expect it to. Those are different things, and both are important.
I'm going to skip right to visualization. This is just an idea that we like, this is just an idea that we like for visualizing data. Of course, there are many ways to visualize data. The techniques I've picked to highlight here, I feel that they're really good at showing the entire data set in two dimensions. To me, this is more appropriate in the context of machine learning because one way they're different from linear models is that they tend to learn high degree interactions implicitly. Decision trees, neural networks, random forest, gradient boosting machines, they're all out there learning interactions in the dataset. If I'm looking at every data as an individual density or histogram, that's not perhaps the most useful thing in the context of machine learning. The graphics and or charts that we've picked to highlight here are things that we think show the entire dataset in two dimensions, which is nice to be able to look at everything in two dimensions.
Some people want to actually put on VR goggles and go exploring through the dataset. Some people like 3D, I don't. I just like two dimensions. You might feel differently, though. If I'm looking at what does this glyph thing even mean? Well, glyphs are essentially just small pictures that can have different colors and different textures, and different orientations to represent different attributes in a data set. What we have here, this just Kaggle data sample data, and we see like this combination of blue blue, light green, teal indicates an older version of windows and in the newest version of Windows Explorer. These are sort of like incoming traffic to a web server. What I can see is, I think I can see clusters.
This helps me look for structure in the rows of the data set that my model should be picking up on, like clusters, outliers, hierarchy, sparsity. What I can see here, I think I see a pretty distinct cluster. These three columns would seem to me to make a pretty distinct cluster in the dataset. If I was to do clustering on a dataset, it seems to me that these three columns and whatever rows they represent would be grouped into one cluster. This gray with light green and sort of teal again, this is the apple stack, IOS and Safari. Seems like that would be another potential cluster. If I'm interested in incoming traffic from bots, that's these red and yellow dots. If this is anomalous or something I'm worried about, these stand out very clearly.
Oftentimes in machine learning projects, we're trying to model rare events. This is an interesting, visual way to call attention to rare events. I think it's a lot easier to see a rare event here for me than it would be just scrolling through an Excel spreadsheet or looking at calling head or tail, my Unix terminal. But again, this is not some groundbreaking idea. It's just the idea that it's easier to understand data when we see it for most of us. And this is a nice way to look at it and summarize it, and potentially see complex structures like clusters, hierarchy, sparsity, and outliers in a high dimensional data set. But to see that in two dimensions.
This is something we're actually working on at H2O. I think it's really an important type of data visualization. What we're calling this a correlation graph or a network graph. It's been around for a while. There's several names for it. The last graphic I showed was about understanding the structure of the rose of the data set. This graphic is about understanding the structure of the columns of the data set. What you're looking at each node in this undirected graph is a variable in the data set. And this is something to do with loans. I think this is free data that Fannie Mae or Freddie Mac put out. Each node in this graph is a variable in the data set. Then the links between them are their absolute Pearson correlation above some threshold. If I see a really thick link, that means that these two are highly correlated. With my eyes, I can actually see the high degree interactions that my machine learning model should be learning from the data. I think that's why this is such a powerful graphic that's well suited for machine learning.
There's many different ways you can do this. I work with this very famous visualization guy, and I know he doesn't like the way I do this, so maybe don't do it this way. The size of the variable here is how many variables are correlated to it. The degree of the node is the size of the node. How many variables are correlated to it is the size. Then the colors come from a graph community algorithm, which in this case, you could think of as variable clustering. These are clusters of correlated variables. The variables to their same color should have something to do with each other. We can see very central variables like maturity date, original loan term, original interest rate. Then we can also see sort of unimportant variables out here. These are variables that you would not expect to have high importance in your model. If your model had shown these with high importance, that would make me question the model. It would decrease my trust in the model. Then you can think of it as variable selection. If I'm doing some kind of standard like stepwise variable selection, I would probably look to get the biggest variable out of each of these different color groups.
Now, if I was doing something fancier like elastic net variable selection, I might expect to see big chunks of these different color groups coming into the model at the same time. If I'm doing a decision tree based model where I can calculate variable importance, again, I might expect to see some of these larger variables up in the top of the variable importance. If I'm doing a decision tree, I might expect to see some of these important interactions like original interest rate, original loan term, maturity date. I would expect to see some of these variables above and below each other in a decision tree. Again, the idea is to understand your data so that you can then model the data.
The size of the circle. In this case, what I've selected, you may select something different, but what I've selected is the size of the degree. How many other variables are they correlated with? Interest rate has many incoming connections or outgoing. It's not directed. It has many connections, and that's why it's big. And Channel R over here only has one connection, and that's why it's small. Any other questions? I cannot figure out how to make the slides go forward. Is that what's going on? Isn't that what I'm doing? Is that what you mean?
I'm recording this, and I had to set some special settings, and I think that just messed it up. 2D projections. Again, this is about understanding structure and the rows of your data set that your model should pick up on. This is the famous MNIST data set. It's digit zero through nine, and it's originally 784 dimensions wide, which is fairly wide. We want to project it down to two dimensions so we can see it with our feeble human eyes. What can I tell from these projections? Well, I can tell that there's some clustered structure in my data set, and maybe I should try to take advantage of that when I'm building a model. I can see that these purple triangles, which represent zero, are far away from the blue circles, which represent one.
I can see in the principle components, the linear projection, the sort of fast and dirty linear projection in the middle, everything's sort of overlapping. In this more complex neural network projection, I can begin to see some of the clusters of digits separate out. Other things that are interesting to me here, I would be interested, and it turns out that they are in this sort of academic case, these outliers from this projection. Are they the same as the outliers from this projection? Then I would feel certain that these are outliers. If I'm seeing a lot of outliers, then I should know to be wary of using a squared loss function. Because when I have outliers, my outlying observations will cause my squared loss to be very high and give undue influence to these points. If you don't want them to have undue influence, that's something you should think about. Again, I can see some clustered structure. Maybe I would want to build a separate model for each cluster. Again, the idea is to try to get some understanding of the structure of the rows of your data set before you go to build a model so that when you build a model, you have something to expect, something to test the model, some expectation to test the model against. The glyphs, the correlation graphs, and the projections I would call these are data visualizations. The next two, I would call model visualizations.
Partial Dependence Plots
Partial dependence plots have been around for a long time. Some people know about them, and some people don't. We have them in H2O. They're really easy to do in H2O. I know this is small and hard to see. This is just a famous thing from a textbook, a very good textbook. Home value, we're trying to predict home value. What we're seeing is median income goes up under the model, home value also goes up. When median income says four here, and that's just some scaled value. The partial dependence is I take the validation data set, or the test data set, or even the training data set, and I take every row of median income, I erase what's there, and I put four there.
I run every row through the model, and I get a score for every row. Then I take the average of that. That's this point right here. You can see it will be very time consuming to generate these plots, and that's the main drawback of them. I mean, another drawback is they're only telling you the average behavior, but usually, that's not such a bad thing. What this is telling me is that, on average, when median income equals four the scaled response from my response function is zero, which is just some scale value. Let's say it's like $60,000 when median or when median income is $60,000, a home value is $200,000, something like this. That's what this plot is telling you, and it tells you exactly how the function behaves, the function that you've trained, and that's the powerful thing about partial dependence plots.
We can even make these two dimensional partial dependence plots. Here we have house age and average occupancy, and what we can see is that there's an important interaction, an important non-linear interaction between house age and average occupancy. Down here, when average occupancy is around five people in a house, the age of the house doesn't really seem to have much effect on the value of the house. But when average occupancy is only two people living in a house, then we can see there's this complex, non-linear dependency between house age and average occupancy. There's some kind of important interaction, some important non-linear interaction in between these two variables. Partial dependence plots, in this case, point that out very sort of starkly and very clearly.
Now, of course, like I said, there are drawbacks to these plots. Like I can only see two degree interactions. If I have a neural network that's fully connected and I have a hundred hidden units in my first hidden layer, then I have a hundred degree interactions that I'd be interested in. Can't see those, we can only see the second degree interactions, but still, if I know house age and average occupancy are important in my model, I might want to take a look at these graphs and make sure they're behaving the way I think they are. As median income goes up, house value goes up. That makes sense, and then it seems to plateau. Does that also make sense? I think the powerful thing about these plots is you can generate them for the models that you train, and then you can compare it to your domain knowledge or some expectation you have.
This is old, this is really old, residual analysis. I want to plot the predicted value of my model that's on the x-axis against the residual. For linear regression, Y minus Y-hat, or Y minus Y-hat square. A good machine learning model should model out everything except random error. When I do this, I should just see a random distribution of points. Again, this allows me to see my data, or a lot of my data, in two dimensions, outliers become visible. If I'm seeing strong patterns in my residuals, this is a dead giveaway that there's something wrong with my model or something wrong with my data. This means there's some kind of relationship between these rows that my model is not picking up on.
It's very hard to diagnose exactly what that is. You can break the plots out variable by variable. That can sometimes give you an insight like in these two input variables. We see the same strong linear pattern. That can give us some insight into what's going on. I think, in this case, it turned out that it was a categorical variable that was being treated as numeric or something like this. The idea is my residual should be randomly distributed if I have an accurate model. If I see strong patterns in my residuals, I can go hunting for them and hopefully get some insight about how to correct them. Again, I can see all my data in two dimensions or sample my data in two dimensions. Things like outliers become immediately obvious. Any questions about the visualizations before we move on?
Last slide. Narrow down.
You would at first want to plot, so on this axis, we have the deviance, which for linear aggression is just Y minus Y-hat. Then here, we have predictive values. If this looks good, then you don't have to go hunting. If it doesn't look good, then unfortunately, you do have to go hunting. I would start with the important variables. I would start with things with high variable importance. Big data is a pain, though. One thing I like to point out, the way we make these graphs, there's something, and H2O is totally free. You can go download it. I'm not going to do too much of an H2O commercial.
We have an aggregator algorithm, and it's specifically meant for visualizing big data. It does a very complex density based sample. You can take like a million rows, and representatives, 5,000 rows, and do a little bit better than a, than a random sample. That's a good way if you're like, well, I have a million rows and my screen doesn't have a million pixels. What am I going to do? Well, you sample it, or you can go get this nice aggregator from H2O. But good question. Any other questions?
OLS Regression Alternatives
All right, let's keep moving. All right. I work in a regulated industry. We're never using neural networks, gradient boosting, and random forest, or hidden mark off models, or whatever. We're just doing linear models. That's fair enough. One thing, the method of least squares, was invented in the 18-hundred. Now in 2017, I'm trying to convince some banking customers that it's okay to use regression methods that were invented in the early two thousands. They're not quite convinced, but I feel like we can keep working. I mean, a great important research has happened in linear modeling in the past 20, 30 years. One very important thing is penalized regression.
I could give a whole lecture on penalized regression, but the idea is that I constrain the values of my regression parameters in a certain way that makes my model more robust. I'll leave it at that. It's especially good for data. Penalized regression can work when you have more columns than rows. In this case, if I have a thousand variables and I'm doing stepwise variable selection, I have to specify some alpha parameter to let variables in and out of the model. If I specify it as 0.05, then I've basically said, I'm okay with making 50 wrong decisions about variable selection. Penalized regression has a way to totally avoid this problem. Penalized regression, in general, is robust to correlated variables.
If I want to use age and years of work experience in my model, I can usually just go ahead and use both of those and not have to worry about numerical stability problems. Some of my students are here, like in 2017, I just don't understand why you would be doing ordinary squares regression. But far be it for me. I would be doing penalized regression. Another nice, newer regression technique, generalized additive models, very cool idea. I can tell my model that for some variables, just go ahead and fit a standard linear relationship with one coefficient. For other variables that I don't know if there's a linear relationship or not, I can say fit a 3000 knot regressions plan. I can get these very complex relationships between inputs and the target for these variables where I allow it to fit a regressions fine.
In some cases, this is perfectly fine for regulated applications. And in other cases, the regulators might come back and say, no I don't understand what's going on, or I don't trust this. If that's the case, sometimes you can get lucky. If I look at this, this basically looks like the log of the variable, the log function. If I just say, this tail over here is noise. I might replace the 3000 function that made this variable, something like that. Sometimes you can get lucky doing GAMS that way. But in general, GAMS are really nice because they sort of allow you to say, fit a linear model for these variables, fit a non-linear model for these variables. That's a nice trick.
Quantile regression, I think, is also very important. Standard linear regression or logistic regression fits the mean of the conditional distribution. In some cases, I just don't think it is that useful. What quantile regression allows you to do is fit your 90th percentile of best customers or your 90th percentile of best assets, your 10th percentile of worst customers, or your 10th percentile of worst assets. You can build separate, interpretable linear models for these different percentiles of your portfolio of assets or your customer market. You can have interpretable models for what's driving your best customers, what's driving your worst customers. If you're in a profession, job, or place where you need to use linear models, be sure to check in to what's going on because there've been some really important breakthroughs over the past decades.
What do Machine Learning Models do Differently Than Linear Models?
Another trick that people use out there is, if you know what you're doing, your linear model should not be that much less accurate than your nonlinear machine learning model that's hard to interpret. What a lot of people will do is they will build a machine learning model, a gradient boosted ensemble, and then they'll get a benchmark for the accuracy. Then they'll try to build their linear model up towards that level of accuracy. Basically, the things you can do are to add interactions. What do machine learning models do differently than linear models? Machine learning models tend to implicitly learn high degree interactions, and they tend to build non-linear response functions. Add non-linearity by piecewise linear modeling and add interactions.
Where would you know how to find the interactions? Well, you can fit a decision tree, plot the decision tree, see which variables are above and below each other in the decision tree, that can show you interactions. Add some of those interactions into your linear model. How do I know where to build piecewise models? Well, you can go back and look at plots from GAMS or look at partial dependence plots and get an idea of where you might need different models. This down here, this could be one linear segment, and then this here would be another higher slope, linear segment, something like that. As long as you are fairly careful and conservative in the number of interactions and number of piecewise elements you introduce into your model, you can still have a fairly interpretable model that's extremely accurate.
Small Interpretable Ensembles
Another idea, and many people who have just been in the field for a really long time and they have really great domain knowledge, and they really know what they're doing and they just can't make a single model anymore accurate. In this case, what can you do? Well, one thing you can do is you can combine models. Just an example approach I might suggest is, let's say I have a decision tree that's really good at just making accurate predictions for general everyday situations. Then I have a logistic regression that's really good for rare event detection. When I go to make my final decision about a new row of data, I can give the probability or the prediction from the decision tree a 90% weight and the prediction from the logistic regression a 10% weight.
Then again, I can go back and use partial dependence plots to know if I retained linearity that my regulators want me to have. Or my bosses want me to have, or the monotonicity that my bosses want me to have, or my regulator wants me to have. I think the idea of combining models is very powerful. If you're like, well, I'm just never going to get away with picking that one model gets 10% and the other model gets 90%. There's a very mathematical way to do this. It's called stature normalization, or a newer iteration of this called super learner. This idea is that I use another model to decide how to weight the constituent model predictions. This is something that will make your models more accurate combining models this way. I think the trick is if you're trying to keep them interpretable, not to use too many models, not to use complex models, but to try to combine simple models.
This is probably one of the most powerful techniques. What I'm showing on the slide is a public use case from Equifax. It's patented. You can't do it, or maybe you can, but don't say I told you you could. The idea is that I can enforce monotonicity between my inputs and my targets either by data preparation tricks. Bending discretization is often one way that I can enforce monotonicity between my inputs and my target. Or you can change the architecture of your model such that you enforce a monotonic relationship between inputs and targets. Equifax is doing this for neural networks. They have single layer monotonic neural networks. You can do really interesting things.
One thing is in a monotonic neural network, all the weights will be positive, and generally, the inputs are scaled to be all positive too. But the idea is that I can see what's the biggest weight coming into my output unit. Okay, it's this one. Then I can see what are the biggest weights coming into this input unit. I can pick up on these 5 or 10 degree interactions automatically. I can automatically programmatically trace back through the neural network, find what's the biggest input, what's the biggest weight coming into my output unit, and then trace that back to a hidden unit and say what are the biggest weights coming into this? In this way, I can automatically find very high degree interactions. That's one really interesting thing you can do.
Another really interesting thing you can do, whether you have a neural network or not, when you have a monotonic model, and this is what credit scoring people have been doing for years with logistic regression, is you can give these turn down codes or reason codes. This is what I'm pushing for so hard, and other people are pushing really hard for this. For every decision that a model makes, for that row, rank the importance of the variables. When you get your credit report, it says these are the three things that made your credit go up. These are the three things that made your credit go down. If you ever get turned down for a credit card, they have to tell you these are the three or five reasons why we decided to turn you down.
The way this works is if I have a monotonic response function, I can find the top of the function. Then I can see where you land on this response function. Then I can just measure how far you are from the ideal person in each dimension. When my surface is monotonic, this means that my decisions will always be consistent. I will never give a loan to someone with a lower savings account than someone who I gave a loan to that had a higher savings account balance. I can make a strict cutoff, and I can always ensure and go to the regulator and go to my boss and say, I never gave a loan to anybody who had a savings account balance less than next because that's what we all decided.
And if this doesn't fit the data?
He was asking, what if this just doesn't fit the data? For years, no matter what the data said. I've thought about this a lot, let's say everybody in my training data with a savings account balance of more than a million dollars is like a drug dealer, and they just disappear in the night, and they never pay their loan back. Actually, according to my training data, I should not have a monotonic response. But the regulators and internal model validation teams just want a monotonic response function. It just makes them trust the model more, whether the data says that or not. What you can do with the neural network is you would still miss that place where maybe the predicted probabilities should go down if everybody with a million dollars in their banking account is a drug dealer that disappears in the night. But with the neural network, you can get little stair steps in your function, in your monotonic response function. Maybe that answers your question a little bit.
This gets back, some threshold so far…
It's a business question. It actually has an easy answer. I think everybody who can just say "f"-it and use a machine learning model and not have to worry about explaining it has already done that, for the most part. What I'm showing here is for people who work in highly regulated industries who are trying to basically find a compromise between interpretability and accuracy, they're constrained. They don't pick those constraints, a regulator or their boss. Another really interesting thing is and it will probably end up affecting the US. The EU has all these very strict data privacy, and some have argued anti-black box decision making regulations coming online in 2017 and 2018. Very soon in Europe, they're going to say no black box decision making.
If this model impacts consumers, no black box decision making. I assume there will be some trickle down impact to the US. Even though in the US, we just like to sell everyone our personal data and let people make whatever decisions they want to about us. But I think it can be startling. When I go into work with banks, it's like, just install this R package, and they're like we'll call you in three months when IT contacts the legal department, contacts the model validation office, and gets this R package installed. I fully believe in regulation, and one reason that I work on this so hard is to make regulation easier. Because I do think these models are important, and I want people to be able to use them in regulated industries. I'm not saying regulation is bad. It can just be very striking to go into a regulated environment if you're not used to working in one. It feels crazy. You're like, how can you even work like this? But these are good questions.
Let's move on to this last class of how we explain. Oh, and I should say from an H2O perspective, this is something we're also working on. We want to build monotonic models. XG boost, the famous gradient boosting library, will build monotonic gradient boosted ensembles. I think you should see something similar from H2O soon.
I should explain again, these slides were like what if I work in a regulated industry. What can I do? How can I make more accurate models? How can I take advantage of some of the research and advances that's gone on both in statistics and linear modeling and in machine learning over the past few decades? That's what these slides were about and retained interpretability. These slides are more for your people who don't care who can use whatever model they want but want to explain it. I would say there's actually more than a regulatory need to explain these things. There's a human need to explain these things.
Presumably, you're here because you work in this field or you're interested in working in this field, and you would like to be able to explain what you do to other people. I spent nine months training this model. Wouldn't it be nice to tell somebody how it works? I spent nine months training this model. Wouldn't it be nice if I could diagnose what happens when it innovatively makes the wrong decision? I think it's actually more important than regulation alone. I think it's very human just to want to understand what you do, want to be able to communicate what you do. I think as these technologies impact our lives more and more, which I'm pretty sure they're going to, sort of people who don't work in the field will want to be able to explain them and communicate why certain algorithms behave the way they did to other people. These technologies are going to have more and more impact in people's lives. I do think it goes beyond regulation to just a basic human need for understanding.
Alright, how do I explain these complex models? This is the oldest trick in the data mining textbook, but it's a good one. Here's my fancy neural network, my deep learning, my convolutional neural network, or whatever. I've trained it on these three inputs to predict when someone will have a bad loan, not pay their loan back. What I can do to get some understanding into what's going on is I can train a new model that I call a surrogate model. I can use the same inputs, but instead of using the actual target, use the predictions from the complex model. I can train a simple model that I can understand, like a decision tree, a shallow decision tree, or a regression model. I can train one of those types of models on the inputs and instead of the actual target. I mean you can do the actual target too, but instead of the actual target, do the predictions. This gives me some insight into if I do a decision tree, then I can see what are the important interactions in my complex model.
If I do a regression, I can say, what's the average behavior of variables in my complex model? Maybe I do both because both of those are interesting things. I would call this a global model. I train a model on the inputs and the predictions from a complex model to give me some global approximate insight into what's going on in the more complex model. I think that it is powerful to combine this idea and not get too distracted by the picture. That was a global surrogate model. Now let's talk about a local surrogate model. I can do a global surrogate model and get an approximate sort of average understanding of how my complex machine learning function behaves.
I can essentially zoom in. This goes back to that theorem from real analysis I was talking about. If I zoom in close enough on my complex response function, it's basically linear. I can fit a linear model around that area and tell you almost exactly what's going on around that area. The idea is I can combine this idea of a global approximate explanation with a very local exact, or not exact, but more accurate explanation. I get a global model that tells me in general what's going on, and then I can zoom in and say, well, what's going on for my best 3% of customers? What's going on for my worst 3% of customers? What's going on for this person who had 170 ATM transactions last hour? The idea is to try to build sort of a global understanding and then zoom in and get a local understanding.
This, I will call LIME, local interpretable model agnostic explanations, is a good paper to go look up. It's sort of couched or presented in the context of image recognition, which for me, at least, isn't that helpful, even though I think it's a great technique for image recognition. That idea of LIME applies though to any sort of business problem. And the idea of LIME is just a local surrogate model. We talked about the global circuit surrogate model on the last slide. Now, this is about the local surrogate model. I zoom in to someplace on my complex response function where its behavior is basically linear and use a linear model to explain that one area of my complex response function.
I'm going to skip this one and just come back and see if anybody has any questions about it. Alright, sensitivity analysis. This is an old one too, and probably there's some business school students in the audience. This is super important, and this is why I'm saying that because I'm seeing a lot of companies transition from linear models to being brave and trying nonlinear modeling and trying to get more accuracy. What I've seen happen in very unfortunate circumstances is that the data scientists or the analyst or whatever we call them go and build a nice new machine learning model that's more accurate that they really like, and they spend a lot of time on it. Then they take it to their model validation people, and they try to validate it like a linear model.
This is a total waste of time and leaves you open to incredible amounts of risk. You're essentially testing things that don't matter and not testing a bunch of things that don't matter and not testing a bunch of things that do matter. This goes back to ordinary least squares as a sort of numerically brittle modeling technique. If I had correlated variables, then I could get unstable regression parameters. But once my regression parameters were stable, I had very stable predictions. I could tell you what the prediction of my model was when someone made $50 a year and when someone made $5 million a year, it's just a plane. I can just extrapolate out. Even if those values weren't in my training data, I could tell you I could extrapolate outside of my training data.
Now machine learning models are almost the exact opposite because they're highly regularized. They typically don't suffer from numerical instability. They're very numerically stable models. Their predictions can be highly unstable, though, especially outside of the domain of the training data. If I don't have a $50-a-year income in my training data and I don't have a $5-million-a-year training income in my training data, I actually have no idea what the model is going to predict for those values. It could predict something completely insane. That's the idea of sensitivity analysis applied to machine learning models to increase both sort of understanding and trust. I need to try to run any reasonable or even almost imaginable scenario through my model to get an understanding of how its predictions are going to behave.
Unlike linear models, machine learning models, once you go outside the training data domain, you basically have no idea what they're going to predict. You literally have no idea what they're going to predict, and they can make a right corner. We're giving people a loan, we're giving people a loan, we're giving people a loan, and then we get up to a savings account that has $10 million in it, and whatever it takes to get the curve to fit the training data correctly, it also makes it make a right turn. It can do that because of the high degree interactions. If several variables change at the same time, I need to understand what's going to happen. I think sensitivity analysis is probably sort of the most important model validation you can do with machine learning models.
You need to try to anticipate how your model will change over time. You need to try to anticipate what are the values that I could possibly expect to get from my inputs into the model and test them. And again, this goes back to the idea that a lot of model validation techniques are based around linear models, which we're basically trying to find hidden correlation that could be causing us to have unstable model parameters. While machine learning generally doesn't suffer from that problem, it makes very stable models. It's just that the predictions can be highly unstable.
Variable Importance Measure
I'll do one more slide, and then we'll be done. This is a really important one. I think most people know if you go get a decent implementation of any kind of decision tree or decision tree ensemble, you'll get a variable importance, global variable importance.
I can tell you this is the Titanic data set from Kaggle. Women in the days of chivalry where they let the women and children live. You see that sex and age are the most important variables in this model. One of my students actually came to me recently and had trained a model in this dataset and this variable SibSp, and the number of siblings and whether they had a spouse or not on the ship with them, that was ranking as the most important model. That immediately was like you did something wrong because I've just done this too many times. I know that age and sex in class are the most important variables in this dataset. Global variable importance is good, but what I think is much more interesting and much more important is local variable importance.
Several techniques have been invented around this recently. I can now tell you the variable importance for each row. This is those turn down codes, reason codes we were talking about with the monotonic models. I can now give you the turn down codes and reason codes for every single prediction that a machine learning model makes. The trick is they won't be consistent. In the sense that if I'm using a machine learning model for credit scoring, I might give someone the loan that had the $5,000 savings account balance, whereas I wouldn't give someone a loan that had $5,010 in their savings account. When I allow a non-linear response function, I can still tell you the most important variables for each decision. You just have to be okay with the fact that the decision boundary might not be perfectly clean.
If you're interested in this, look up LOCO, leave-one-covariate-out. It was a preprint paper. I'm not sure if it's been published in a journal yet or not. Or if it even will be. The idea is I score my data set once, and get probabilities for each row or prediction. If it's a regression problem, I get predictions for each row. Then I go back and score the dataset again. I score it one time for each variable in my dataset. Each time setting the one variable, either to missing or to its average value, medium value, mode value, whatever you think it would be to zero out the effect of any given variable. In this way, I can rank for each row which variable impacted the prediction the most.
I can say what I'm trying to show here is that my original Y-hat was a probability of 0.2 when I set the variable sex to missing. It changed my prediction to 0.01. When I set the variable age to missing, it set my prediction to 0.1. When I set fair to missing, it set my prediction to 0.21. In this way, I can see that sex impacted the decision for this row the most. I can get different variable importances for each row. I can essentially get those reason codes or turn down codes for each row of the data set. I think this is crucially important because now I can go back, and for every decision that a machine learning model makes, I can say this decision was made because first, your age second, the fare you paid, third, fourth. For the person below you in the data set, you could get totally different reasons. I think I'll stop there. I see one question and I see we've talked to you for an hour, and that's plenty. Let's go ahead and have some questions.
Q/A With Patrick
I know him. I can find him later.
Some certain, not allowed to use?
That's definitely true. When I was an intern at a bank, what they did, because they couldn't use it in the common age, they bought a variable that was like a propensity to play tennis. There's some data vendor out there that for every person in America ranks your propensity to place tennis and sells that to banks so they can use it in their models. Of course, that's sort of violating the spirit of the law. People look for tricks to get around it, or they just build the best model they can using the variables that they're allowed to use. Either one. Any other questions?
I think all of these things can be combined. My goal is to make a toolkit where you would be able to use it. We should talk later. If you mean mathematically, I guess I don't follow you immediately, but that certainly doesn't mean very much. One thing I should have said is it's important, and in fact, I think the title of the article is Mix and Match Approaches for Visualizing Data and Interpreting Machine Learning Models. It's important, especially the more complex the model, to use multiple techniques. Did you mean to mathematically combine them?
It might. I'll think about it. Immediately what comes to mind, I really like this idea of combined. LOCO is local, and the L stands for leave. It doesn't stand for local in this case, but LOCO is local. The decision tree surrogate would be global. Just right off the bat, I like this idea of using the decision tree surrogate model to get an approximate understanding and then using something like LOCO or LIME to get a local understanding. I immediately agree with that. I want to think about what you're saying with the mathematical combination a little bit more.
How many rows would you want at a minimum? Is there a similarity?
I've seen formulas before for this. I'm not sure standing up here in front of people and being recorded that I'm going to say anything. I will just say if you do have wide data like that, it's really important to think about what model to use. I'll say, like in genomics research, like when our academic advisors or professors at Stanford that some of them are interested in sort of cancer genetics. They may have data sets with thousands of columns, but just like 10 or 20 records. In fact, they invented penalized progression. It's important to think about what model to use. I feel like I've seen a formula for neural networks. It was exponential. If you have 10 weights, you need a hundred rows. I've seen things like that. But that was specifically for neural networks. It wasn't about the rows and the columns. It was about how many weights you had in a model. But it's a good question, and I don't have a very good answer. But I can say I've seen very smart people use penalized regression for incredibly wide and very shallow data sets.
Can you go back?
This picture? This very famous picture is from elements of statistical learning. There's a link right here. You can always get a free PDF of it. Okay. The idea is for the B with the point that's the set of regression parameters without a penalty, no constraints on the model. That B would be whatever the coordinate was on the B1 axis and B2 axis. That would be your Beta one and Beta two for your model. The concentric oval is going out around it that represents the gravel of the squared loss function. The circle represents L2 regularization, which helps with correlation. I put a penalty on the squared sum of the Betas in the circle picture. It's a circle because I'm saying I'm not going to let the bandage go outside the circle. The way I say they're not going to go outside that circle is I put a penalty on the squared sum of all bays.
The squared sum of all Betas can't be any bigger than the sum number. Then the parameters for the penalized model are the first place that circle touches the sort of 2D parabola of the square loss function. Where that circle touches the oval, that would be your Beta line, Beta two for the final, that would be your constraint, your penalized Beta1, Beta2. That's L2 penalization on that side. On this side, it's L1 penalization and L1 is a constraint on the absolute square, sorry, the absolute sum. The absolute value of the sum. That's why it's a diamond. Because of the geometric properties of the absolute sum being a diamond or a cube or whatever, it just makes it so that gravel tends to hit corners. The closest place that the gravel tends to hit is corners. L1 penalty on the left with the diamond. That tends to do very model left variable selection because here you can see the Beta on the diamond side. Beta1 would be zero, and Beta2 would have some doubt. It is essentially selected Beta2 in this case question.
I don't know if that made sense, but you can read about it. If you don't have this book, you should get it. It's only like 900 pages or something. Oh, it's only 764. This is a really good book. If you don't have it in this field, then I highly recommend it. You don't have to sit down and read the whole thing. I just kind of thumbed through it when I needed to know something. Let's see what those are. That sounds reasonable. That all looks good. This one looks good. Model assessment looks good. Chapter seven I would say is good. Chapter nine and ten looks good. I would say seven, nine, and ten if you wanted to go farther. Other people might have other opinions. Any other questions?
No, you still have to add your interactions manually. The question was does penalizing regression help with interaction? The answer is not really. You still have to sort of make your interactions in your data set and then add them in.
So you do all these things in model? Then someone says, well, I don't believe your model is right. Get the answer you want?
This is an important thing. The question is, and I think it's a general question, and I'm just going to send you a reference. I'm going to send you to two references. His point is, the way I took it is, you're doing all the stuff to the model. You're basically meddling with the model. Whenever you do that, to me, that's a certain kind of overfitting. When I put too much human knowledge into the model, that's a sort of insidious overfitting that's really hard to detect. You won't detect it until you put your model out and it flops. Because you've been inadvertently sitting there, tuning your model, making sure it gives the best possible answer. Then when you put your model out there to go work on new data, you're not there to help it. I think that's the essential idea. All I'm going to say is be aware of that. Here are two good resources for this. Written by smart people John Langford, he's one of the inventors of Vowpal Wabbit. Google research tends to be pretty good too. These are both two ideas about sort of tracking and dealing with this idea. I basically got my hands and my model too much. That's this kind of weird overfitting and how to avoid it.
You may have a model fit of science that's a certain relationship out there. When you fill with the model and fill with the model, then it may look like you design the model to prompt theories. It's out there. A third issue is to have a statistician. He wants to know the statistical copies of your model. You don't have a model. You have you and the thing here. There's no procedure you can run into a simulation to see how good that outcome is.
I would say, in many of these cases, I agree with you, and that's one reason why people will just continue using linear models. But I would say that, I would say that this technique is actually pretty mathematically rigorous. A lot of the information can be backed out of it automatically and programmatically. I feel the same way about this LOCO technique. The LOCO technique does have problems in the presence of correlated variables, and it does have problems if your data already has missing values in it. But other than that, I think it's a fairly rigorous technique. Just so you see, the reference on that informal modeling and all these links are in the paper. That's why I'm referring people back to the paper. I agree and I think that's one reason why linear models continue to be so powerful because linear models don't overfit. This is why I push people towards penalized regression so much, because it combines a lot of the best things about machine learning, a lot of the robustness of machine learning approaches with sort of the interpretability and direct elegance of linear models. I think your point is just yet another reason why people will continue to go to linear models for years to come. But if you're interested in the reference for the LOCO technique, which I think is pretty mathematically rigorous, this is the paper. It's a good point. That overfitting where you just messaged your model too much, very hard to find and very insidious. Any other questions?
Different things. The question is, what do data scientists at H2O do? I tend to do like classes, and I tend to sort of work more on ideas for new products and prototyping new products, and I do some consulting. Other people work almost directly with customers. Some people just work on helping people get their spark cluster and getting sparkling water running on it. Some people work really deep in IT details. Some other people work really deep in sort of algorithm details. It's just different. I would say in general though, at H2O there's three kinds of people aside from the leadership. There's data scientists who sort of do a mix of things who might contribute to the product. Also they might work with customers, they might do training. Then they're actual developers. You can go download H2O. It's free, but it's very, very complex Java code. The actual developers tend to be very, very, very, very, very good Java programmers, not like data scientists, programmers like me. Then there's UI and visualization people. Let's let's go ahead and finish up, and if you have questions, we can hang around for a little while, and I'm eventually going to walk over to the bar, so you're welcome to come with me. Thanks.