Mastering Classification & Regression with LLMs: Insights from Kaggle Competitions
Speaker Bio
Senior Principal Data Scientist at H2O.AI | Kaggle Grandmaster
Philipp Singer loves utilizing his passion and skills in machine learning, statistics, and data science in order to enable data-driven decision making and he is always eager to delve into new bleeding-edge fields and technologies. His top area of focus is deep learning and its applications to areas such as computer vision.
Philipp obtained a Ph.D. in computer science with honors at the Technical University of Graz where he also finished his Master’s in Software Development and Business Management. He has been published in seminal conferences and journals, and has year-long experience in applied machine learning.
Read the Full Transcript
Philipp Singer 00:06
All right, yeah, thanks for the introduction. I will not bore you more with introduction stuff about myself. But yeah, what I want to do today or now is to talk about classification and regression with LLMs coupled with a few insights from casual competitions.
Philipp Singer 00:25
And we already had a session earlier in the training session talking a little bit about our recent LLM science exam competition results. But this one is kind of more general about classification and regression with LLMs.
Philipp Singer 00:42
Yeah, we have heard today a lot of different things about use cases around LLMs. And usually we talk about generative use cases. So we usually talk about like text generation, rock use cases, labeling, agents, and so on and so forth.
Philipp Singer 01:02
But what I want to do now at the end of, close to the end of the day is to take a step back and try to figure out if we can actually also use LLMs for more classical use cases, no pun intended, like classification, that we have been doing over the last few years, decades even already.
Philipp Singer 01:23
So what actually about classification? And actually simple text classification is what happens a lot in different kind of companies, businesses, and I have observed it a lot that this is a very, very popular use case in general.
Philipp Singer 01:40
And what I want to explore now in this talk a little bit is whether we can also make some use of LLMs or GenAI in general for text classification. So think about things like sentiment classification, document categorization, spam detection, topic classification, language detection, and many, many more.
Philipp Singer 01:59
So just having a bunch of documents needing to classify them into certain types of categories. The common way or how we have solved this the last couple of years was via supervised training most of the time.
Philipp Singer 02:14
So meaning you have a bunch of labeled data, you train a model on the labeled data, predicted on the unlabeled data. 10 years ago, we were doing mostly this kind of bag of word approaches, meaning to build tabular data out of text corpora, building a large vocabulary saying, okay, the word cat or dog is 10 times, 5 times in this document, building a tabular model on top.
Philipp Singer 02:40
Nowadays we would use a gradient boosted model and using that works decently well, but has been kind of replaced over the last couple of years with transformer models as the transformer architecture is now also present in all the LLMs, which is based on this attention is all you need paper.
Philipp Singer 02:59
And these transformer models are not new, but they have existed with like... You might have heard of BERT models, roBERTa, deep BERT is kind of the best for these kind of types of models. Just very simply on the right -hand side here, we can see that how these pre -trained models are trained.
Philipp Singer 03:26
So you have a BERT model is trained with so -called masked language models, which means that on this large -scale pre-training, they are not trained as LLMs, as I will show in the next slide, but basically with masking random parts of the input, and the model needs to learn to predict these random tokens.
Philipp Singer 03:49
So it would need what is an LLM and LLM is. Sample it would need to learn to predict N or the LLM mask, basically. So this means these are so -called encoder models, so they can look forward and backward, which is very different to GPT models as shown on the next slide.
Philipp Singer 04:09
So yeah, but these models are not really usable out of the box, which is maybe one of the reasons why they are not as hyped as LLMs nowadays, because you cannot do a lot with them out of the box, because they have been trained in this way.
Philipp Singer 04:24
So usually there comes into play, transform, transfer learning, so you fine -tune them to a certain task, and for classification it's like one of these tasks, token classification, sequence to sequence, all these different kind of NLPUs cases work really well when we fine -tune these types of models.
Philipp Singer 04:45
And the LLM way gives us now different opportunities. The first one, which is kind of new, is that this gives us opportunities to now do zero -shot classification, which is why, because these models have been trained in this kind of generative manner, next token prediction.
Philipp Singer 05:04
The example here, when we talk about the llama model, this has been pre -trained on huge corporates of text, and they have been already always pre -trained to just do like next token prediction. So it would only want to predict is as the last or the next token, and then the next token, the next token, the next token.
Philipp Singer 05:22
And these are only decoder models, which means they only look to the past and not to the future, which has advantages but also disadvantages. But the advantage is that it can give a zero -shot classification.
Philipp Singer 05:34
We will explore this a little bit, but we can also still fine -tune them similar as BERT models for classification itself. In the rest of this talk, I want to go through a simple use case, which is based on financial sentiment data, public data set from, can be found on Google, Huggingface, Kali.
Philipp Singer 05:57
It's a pretty commonly used. sentiment data set which has like similar to tweets short text snippets which give statements about sentiments or like statements about earning reports and so on and they have been hand labeled for negative neutral and positive so very classical sentiment classification but not easy as like only negative positive because this neutral makes it a bit tricky we are only looking at those that have been 75% annotate agreement which gives us 3 ,500 rows with and I split this manually into 80% training and 80% and 20% validation here on the right hand side you can see a few samples like the company's profit totalled 570k in age 1 2007 down 30% year -on -year was labeled as negative which means overall we have 60% neutral labels 26% positive and 12% negative and what I want to do now is that I want to start always when we do data science or machine learning projects we should start with a baseline and try to improve this but baseline would be very very simplest baseline would be majority baseline only predicting neutral meaning 60% 62% accuracy in this case how can we use LLMs now what we could do is that we just ask an LLM right LLMs have this generative power they are very good for this kind of zero shot zero shot tasks so we could a calls agent could just paste this content a call content an email whatever into an LLM and ask the LLM what do you think is the sentiment is it negative is it neutral is it positive and it will give you it will give you a really good answer.
Philipp Singer 08:02
This is actually the neutral example from the previous slide. And the 70b llama model on H2GPD gives you a really good prediction. I would classify the sentiment as neutral and it even gives you an explanation of why the model thinks this is the way.
Philipp Singer 08:22
But streamlining this is a little bit difficult. This is very useful for quick checking. So you can just go there, paste it in, get the result. But making this really into a business process is a little bit tricky.
Philipp Singer 08:37
The positive thing is no training is needed. We don't need labels. It is very easy to get started. And we have the side effect of having kind of interpretable results by these explanations. For example, in this case, the LLM really gives a nice explanation why it even thinks that this is neutral.
Philipp Singer 08:56
But the prompt engineering can be very tricky. but very important as we've heard throughout several talks today, which means there needs to be some effort put into this prompt engineering. It is kind of that makes it difficult to automate because also the outputs can be very different, right?
Philipp Singer 09:15
So think of how would the automatis you would, you could check is there neutral in the output and then say this is neutral, but there could be also different things. It could say it is positive and maybe neutral or out of positive negative neutral, I think it is positive, right?
Philipp Singer 09:31
So it can even not say anything. It is very tricky to automate. There are definitely different tools and I will not talk about this now that can kind of optimize this output, maybe to a JSON format, maybe do a more strict format that can help there, but it is still runtime expensive because generation is expensive and you still need some kind of evolution because you want to, you still need to evaluate it in some way.
Philipp Singer 10:00
There is actually a quite neat simple trick to make zero shot a little bit more streamline it, to streamline it a little bit, which you can do the following because as it is always just a next token prediction, you could put in here, your task is to analyze the message below and predict whether it has positive negative neutral.
Philipp Singer 10:25
Here is the return on the investment, blah, blah, blah. And then that sentiment is now the LLM will try to predict the next token and hopefully as we are steering it directly into this direction that it is hopefully now saying the sentiment is next token probability negative positive or neutral because that's what happens in an LLM gives you an output probability for the next token out of the whole vocabulary.
Philipp Singer 10:52
And in LLM for example, it's 32 ,000 tokens. So here at this rebution could look like, Oh, it's now negative for 20% positive next token probability would be 10% neutral also 10% and many many other tokens would be the rest But it is still tricky to see like what are the popular token probabilities?
Philipp Singer 11:14
It could be also like pause or like Some abbreviations of that but that is still a very simple trick And you can tune this a little bit with the prompt engineering to get like a good Logic distribution here that you can use to automatically streamline and it is also way faster because you just need to Generate a single token actually you just need to generate the single next token and not like the full output You probably cannot read the code, but some people prefer code like myself and in the end It is very simple to do You just take like the input you tokenize it you forward it take the last token probability and then just search for the indices of the tokens, positive, negative, neutral in these logits and sort it by that way.
Philipp Singer 12:10
So in this case, we would get for this previous example, or actually a different example here, we would get logits for negative close to seven, neutral close to seven, and then eight for positive. So in this case, this is a correct for this example, it would sort it that way and give positive the highest probability.
Philipp Singer 12:31
So now if we do this exact approach, we've exactly prompt, not tuning the prompt too much, for the example we had before on the same validation data set, using a seven B model, zero shot, and a 13 B llama model, we can already get to an accuracy of 75% for both of these models.
Philipp Singer 12:53
I also tried a 70 B model, not really better there, but as I said, there is definitely a lot of prompt tuning you can do to improve this, but it already gives you like, this is basically free, I would even say, not free in a way, but this is free, this model exists, you don't need to train it, it is kind of free to give you this 75% is already not too bad for such a use case.
Philipp Singer 13:19
And you can definitely, if you spend a little, a few iterations, you can definitely improve such a use case. But now the question is, can we even find unit, right? If we think about this, we are just having like the model outputs a larger distribution, and now all we would need to do is map this logic distribution to just free outputs, right?
Philipp Singer 13:47
So in this case, we would only need to train another classification head on top, that maps us this output distribution of over 32 ,000 tokens to negative 32 ,000 tokens. neutral, positive. So we can just attach another custom head to this whole model and fine -tune this.
Philipp Singer 14:08
And we could even say what already works well, we could even say we are fixing the LLM and we are just training another head on top. This could be even something simple like a logistic regression, this could be even a gradient boosted tree, where we just learn on several outputs for this distribution, how shall we properly map it to this negative neutral positive?
Philipp Singer 14:31
Instead of just taking hard-coded, I'm only looking at positive negative neutral, because there's probably signal in some other tokens in this whole distribution for the prediction. But we can also fine -tune the LLM combined with this head.
Philipp Singer 14:47
So we can just do regular lower training of the LLM combined with fine -tuning this head. And you have probably heard that there is a tool like H2LLM Studio and we have now added to this open source repository a new problem type which exactly allows you to do this classification use case and we also successfully use this in the LLM Science exam competition on Kaggle and it is very simple you just need to upload a CSV there start an experiment to predict the classification head or the classification label and you can do the full fine tuning and usually get pretty decent results out of the box.
Philipp Singer 15:38
Again I did not do any tuning here but I just ran a few experiments with actually also a freebie model a 7B model and a 13B model on this training data set that I mentioned earlier and we got very quickly to something like 94%, 95% accuracy.
Philipp Singer 16:01
And here on the right hand side you can see the runtime 0125 for the freebie and this is not ours this is actually minutes. So this whole fine tuning on 2000 samples on decent GPU but even if it takes a couple minutes on a smaller GPU this is pretty doable.
Philipp Singer 16:22
Of course you need more labels because you need to train it on some label data but even for the zero shot you would need some labels because you need to properly validate it. So yeah putting this into context we can get really good here you can see this is probably already kind of the theoretical limit because larger models are not significantly better but even with a freebie model which is even smaller than the zero shot models I had there you can get already really good results.
Philipp Singer 16:57
I have been always under the impression, because I mentioned it in the beginning, or my suspicion was that when LLMs came out, that they will be mostly useful only for these generative use cases, but to fine -tune them for classification, because always historically, GPT2 was really bad for training these models, for example, was that the BERT models will still heavily outperform it.
Philipp Singer 17:22
But we have now seen over the last months, for example, on Kaggle, that actually people train these LLMs for classification, but also for regression, with really, really great results. So, and they have also these kind of properties that you can even cache certain parts of the input, because they are decoder models, so they have also other advantages.
Philipp Singer 17:47
So, for example, We will have another session in a minute revealing the results of this competition, which is our community competition of this event here, where the task was to predict whether which of the different LLMs or a question unsappear to which LLM it belongs, was a classification use case.
Philipp Singer 18:09
And I submitted in the beginning, right after the competition launched, a benchmark for the competitors to benchmark themselves against. And all I did was just upload the CSV file to LLM studio, run this simple baseline model and submit it.
Philipp Singer 18:25
And even though I dropped, now this is not correct, the 11th, it gave a good benchmark and it was rank one for a week until people caught up. And this took me just an hour, literally pressing a few clicks and ending up with this model.
Philipp Singer 18:40
And also I read today in the beginning, we will see the private leaderboard a little bit later, but in the beginning, the people top on the public leaderboard all used, or most of them used actually LLMs for classification, which kind of really shows the usefulness of those models for classification.
Philipp Singer 18:59
Also we participated, the title here is classification and regression. We participated in another competition, which was actually a regression use case. And instead of just having now a classification had, you can have a single had with like predicting a continuous output and you can do the exact same thing.
Philipp Singer 19:20
And it also works really well. And for this science exam competition, it was classification where we ended first. And there we also exclusively used 7B and 13B models for classification, where it was like predicting whether an answer is correct or not.
Philipp Singer 19:41
And there we could also use all these caching decoder tricks that make this really, really useful. So, yeah, with that, that's my talk. Thank you, everyone. And if there are any more questions, we can have them later.
Philipp Singer 19:54
Thank you. Thank you.