LLM Interpretability | Kim Montgomery
Kim Montgomery, Principal Data Scientist and KGM, H2O.ai
Okay, well great to be here today. Somebody asked me if this was the quick easy topic since it was a 10 minute talk and clearly interpretation of LLMs is an easy topic. So I'll do my best with 10 minutes.
Yeah, taking from one of the training slides. They were comparing general traditional AI where you're doing supervised learning to the newer gen AI foundational models. And the bottom shows a graph of what we're typically doing with supervised AI.
Generally you have a data set, maybe a tabular data set where there's something that you want to predict. It could be a hospital data set where you're predicting whether a patient is going to have a certain diagnosis.
And you have data on each patient and labels and you feed that into a model, maybe a neural network model, maybe a tree based model like a gradient boosting model. And you come up with a model that might predict how a similar patient to the training set would be diagnosed.
The newer gen AI models are a little bit different in that generally you're starting with a NLP data which produces a novel output. And I'll mostly call it a test. concentrate on LLM models as my example of GenAi models, since that's what we've been talking about the most.
And one thing to note is even though we're coming up with quite different models, the GenAi models perform differently than traditional models. In both cases, we're dealing with really complicated, difficult to interpret models.
In the case of traditional models, you might have some ensemble of neural networks with thousands or tens of thousands of neurons, or you might have a tree -based method with 10 ,000 trees. So it's not that we're not used to looking at models in which it's very hard to decide what the model is actually trying to calculate.
With the Gen AI model, the main difference is that it's probably using a specific neural network called a transformer. And it's usually a much larger scale model than something that you'd use for traditional AI.
But in either case, you're dealing with a situation where you have a really large complicated model and you're getting a prediction out of that model. But it's not necessarily clear what the model is trying to calculate in order to make the decision.
You probably won't be able to see. You may be able to know what features are going into the calculation if you're doing a supervised calculation. But it may not be clear exactly what the model is trying to calculate across 10 ,000 trees or a huge neural network.
And that's the same for the Gen AI models. So it generally makes sense that we can apply some methods that we might normally use for our regular, large, supervised models to try to understand Gen AI models.
Yeah, and similar to supervised models, we also run into robustness issues. If you feed it something that's not in the training set, that's not similar to the training set, you may get some strange results that would be analogous to hallucinations that we see for LLMs.
Model probing for supervised methods can be used to obtain information about the training set, which could be private. So that's similar. to the jail breaking phenomena that we see for LLMs. And certainly in both cases, the model output can be biased towards certain groups.
Yeah, so I'd argue that LLMs aren't completely special compared with other models that we've looked at. Another analogy between supervised AI and LLMs or Gen AI is we're interested in both global properties of the model, such as how accurate the model is overall, what features the model may be using in order to perform its calculations, whether it's fair to different groups.
And so we want to have a general sense of how the model is performing globally. And we also want to be able to interpret a single response. We certainly would be interested in whether a response, on whether a prediction for say a patient in the hospital is correct or incorrect, what kind of information went into making that prediction, and possibly how the prediction might change if you change a few things for the patient, which would be an adversarial analysis.
And for a traditional supervised model, I ran a model just in driverless AI for a supervised model. And one example comparing global analysis to local analysis is we can look at the feature importance for the supervised model and see which of the patient properties was most important in predicting the patient outcome, which was patient death in this case.
And this is globally which of the features were used the most across a number of different patients. But we can also pick a specific patient, say the first patient in the data set, and compare which features were important to the first patient in the data set compared with what was most important globally across patients in the data set.
So that's how we might go about analyzing a model both globally and locally for a regular supervised model. And we're interested in doing similar things in terms of interpreting LLMs. If we're choosing a specific LLM model, we're going to want to know whether it's a good model for...
or whatever application we're trying to perform. So we'd like to have data on how different models perform on average for our application. We want to know the percentage of time the model's going to hallucinate.
We'll want to know how frequently we're getting undesirable properties from the LLM, like toxicity, privacy violations, whether it tends to be unfair to different sets of groups of people. And of course, we're also, in addition to having global measures which allow us to pick the best model for whatever application we're working on, we want to have local measures in order to understand a specific response.
We probably want to screen out responses with negative properties, such as toxicity, something that tends to leak information that may be private, something that may be unfair or. create stereotypical responses about certain groups.
Another commonality between supervised learning and generative AI is we're interested in the accuracy of the results. And here I'm just showing a confusion matrix, showing the prediction of the model, zero or one.
And this is for the hospital patient data again. The predicted label is on the horizontal axis, and the actual label is on the vertical axis. So in the case of a supervised model, it's fairly easy to analyze the accuracy, because usually for the training set and the validation set, we have labels for the data.
So we can just compare what the model is predicting for different patients to the actual label for different patients. However, for generative AI, things are a little more complicated. I might have asked for more than 10 minutes actually.
It may not be the simplest topic. But the big difference for generative AI is you generally don't have the actual response that would be the perfect solution to compare it to. So we need some way of deciding whether the model is, whether the model is hallucinating and just making things up, or whether it's giving us something that's a fairly correct answer.
And frequently, the way to do that is to have some reference source that you can use to compare the LLM results. result to some factual information. And H2O Enterprise GPT does a really good job of that.
I'll skip my example unless I really have extra time. But so Enterprise GPT uses a rag which other people have spoken about quite a bit for the situation where you're a series of documents. And the nice thing about that is it not only gives you a response, it can tell you which documents it found the response in and which part of the responses were responsible for the overall response to your question.
So that's one way of at least linking some factual information that you might be able to use to understand your model is hallucinating or not. And there are other sources of data that you can use to try to confirm your model.
People have tried to look things up in Wikipedia to confirm model results and estimate the amount of hallucination. People have compared tuning data, so if you've retuned your model, you can see how using different tuning data might affect hallucinations for whatever application you're trying to look at.
It would be desirable to compare your LLM output to the whole training data set, but frequently for LLMs, the training data set is so incredibly large that that's just not possible at this point, but it would be great if we could.
And there's another interesting method that just checks self -consistency. There's a self -check GPT method where people actually try to understand whether a model's hallucinating by resampling the solution to the model for something where the temperature is greater than zero to get different responses.
And if you see something that's only appearing a couple of times, you can figure that that's not very consistent and it might be in danger of being a hallucination. Another thing that LLMs really lend themselves to in terms of analysis is a type of counterfactual analysis.
So if we're looking at supervised models and a counterfactual analysis would be when you want to change something and try to understand how changing something like in the hospital data set, maybe blood pressure would change the patient's outcome.
So for a supervised model, you might just change different features in the data set to try to understand how that's changing the outcome of the model. And the really great thing about LLMs is there are just so many things that you can look at.
There are so many things you can change in order to try to understand the model, in order to understand how that changes the model's output. So for instance, people have done a lot of work in prompt engineering.
You can look at different prompts, different instructions related to the prompts, and see how that's affecting your model output. If you want to look at fairness, you can do things like switch pronouns from she to he, and see how that is affecting responses on average.
If you're using a rag, you can look at how changing the context you're fading the model is affecting responses. Yeah, the field of counterfactual analysis for LLMs is just a really interesting field.
Unfortunately, my 10 minutes probably isn't enough to cover that. Yeah, and I think I'll skip the rest because I'm probably done with my 10 minutes. But, yeah, interpretation for LLMs is really exciting.
I'd argue that you can use many of the methods, many of the methods that are used for regular supervised learning are still useful here. But, of course, since we're dealing with more complicated unstructured data, we also need to develop new methods.
So I've probably failed at covering absolutely everything in 10 minutes. Thank you.