Return to page
h2o gen ai world conference san francisco h2o gen ai world conference san francisco

Practitioner's Guide to LLMs: Exploring Use Cases & a Glimpse Beyond Current Limitations

Pascal Pfeiffer, Principal Data Scientist,

AI-generated Transcript


Thanks, Rob. Hope this is my slide. No, it isn't. Okay, thanks. So I'm going to talk today about a practitioner's guide to LLMs. So through the, so what we are doing on Kaggle and what we are doing with LLM Studio, which is open source, we did get a lot of use cases that we are fine -tuning on that we are using LLMs for. 



I want to give a quick introduction into what use cases could you use LLMs for and what are the best practices and what are the current limitations even for LLMs and how can you maybe push that boundary a little further than you're used to. 



So every day we are hit by lots of news around LLMs and it starts where people are using LLMs for common use cases that are usually or would have been used with different models and now it's just becoming way more easier to just use LLMs. 



For example, you can even do a peer review process with LLMs or doctors use chatbots to improve their bad side manners and there are also some doom stories like all the jobs are at risk now and yeah, we kind of get AGI and AGI is taking us over. 



I don't really believe that this is going to happen and we really should understand large language models better. How are they working right now and how we can use them to improve what we are currently solving. 



So how can you actually use LLMs in your company or even for your personal use? We have heard that sentence a lot like everyone needs their own GPT, everyone needs to use GenAI, but what can I actually do with GenAI? 



We've seen the app store with some very neat applications and hopefully you can all check it out now. There are actually three big use cases that I see as being the most important right now If you haven't checked them out already, you should start exploring them right now. 



The first one would be RAG, Retrieval Augmented Generation. Anything where you're storing additional facts or additional knowledge and you want to make that available to your employees, for example. This is where RAG comes into place and you kind of feed additional context to the model to ground the answer on that corpus in opposed to using the internal knowledge of the LLM. 



And we will see a nice talk about that in more detail later. So I'll head to the second, which is the summarization part. And again, this is taking a huge text and distills it into a smaller text. And inherently, this is something that LLM's are very, very good at because they don't need to make up stuff and they don't need to use their internal knowledge to find new things. 



They will just use what is already there and they are inherently grounded by the problem type itself. You do need very large context, but we saw that 30 ,000 tokens is now probably the norm even. So you can already add a lot of information to that context. 



Can be full talks or almost full code bases even. The third part is a little bit more tricky because it involves expansion of your prompt. You have creative writing, but I believe you can very easily use it as a co -writer. 



So generate smaller texts from a prompt, but use it iteratively. So use some initial prompt to brainstorm together with your co -pilot, co -worker, LLM, and iterate from there. There is much more and I'm going to skip through that very quickly. 



So you're going through a writing helper and we will have a very great talk about classification, also touching regression maybe later from Philip Singer. So LLM, The LMS are also taking over the original space of NLP deep -automotor models now. 



Then there is function coding. So GPT can help your coding. And it also happens that agents are gaining more and more pace now, even though they are not ready done that, or they can't really solve very complex questions. 



But it's a good start. And I think that it will only evolve over time. And in the next years, we will see lots of things, also from the open source community, being added here. So some of the best practices I would also always recommend using when tackling any of these use cases is to start with prompt engineering. 



Like I said this morning, this is the very easiest way to get started with any problem, because you can just use any endpoint, for example, the H2O GPT endpoint that we have running, or your self -hosted model on prem. 



And you can just ask a question, first without any additional context, just using the internal knowledge of the LLM. And you should be very detailed in your instruction, because the model, they don't really know what you're talking about, and if you provide more detailed instructions, they tend to be better in the answers. 



And the same applies to if you ask for being a professional writer, for example, the models will be better than if you just ask it to write a blog post. So the models are actually, you really need to be very specific in what you're asking. 



And in the second step, I would try to add context. And this can be a few -shot context, so you will actually give it an example of what you need, and then the model will adapt to that style and how you expect the answer to be. 



Or it could be additional context, like additional information, which you may retrieve by any similarity search, like in RAC, or even manually. So this would be the second step. And you can play with the context here how much you want to add and if it impacts the quality of the output. 



As a third step, we will consider fine -tuning. And fine -tuning always should have a proper evaluation. You don't really want to look after each experiment if the output has become better or worse. You want some automated way to have that. 



So spend some quality time on creating your own metrics. Think about what's worth for your business case. and build that metric first. It will always pay out in the end. And then start with small models because you can much faster iterate through the experiments. 



And yeah, I think it's way underrated, but the data is super important in these cases. And if you happen to already have some data, in the sample I will show in a second, where you, yeah, that is super great. 



But if you first need to make up data, probably prompt engineering and RAG will be much easier to get some quality results first. As very last step, the reinforcement learning by human feedback is one great option to align the model exactly to what you want because here you can actually also penalize the model for wrong answers or for bad answers that you don't like. 



not trivial for standard fine tuning, so it's kind of this special RLHF part. And there are two main techniques which are PPO and DPO. One needing a reward model, which is additionally trained on, which must be trained on a lot of pre -labeled data from human feedback. 



And the other one inherently has the reward model within the training pipeline. So you basically only run through a good sample and a bad sample, and you compare those two and like, subtract the the logits from it. 



Yeah, it goes up from from the top to dot and to the bottom and you always add a layer of complexity. So make sure to always max out everything that you can from the layer above, which will also help you in the long run to learn more for your next step. 



Business intelligence is actually one use case that I can think of for summarization, which should be very, very valuable for every company, because there is so much data coming in every day. You have logs, documents, demos, presentations, or even meetings, which can be recorded and transcribed. 



And all of this kind of always stays in a bubble within your team, and it's very hard to let others know about these things. And also upper level usually gets all the information with quite a delay. So if you would automate the summarization of all of these pieces together and you bring these information to the teams or to other teams to C -level, this will help to identify work applications much, much quicker. 



It will allow you to find new synergies, and I think it really will help you. would help to improve staff motivation, because you can provide very up -to -date information to everyone, and nobody feels to be out of the loop here. 



You could even personalize it in the style. So maybe some people, they just don't like to know about some topics. So there can be even a reward model in the loop, which will filter out the best news for you. 



And yeah, I want to jump into exactly this use case, not taking any internal transcriptions, of course. So I will use TED Talks instead here. So what we have here on the right is a transcript of a talk, which is around 15 minutes long. 



And it's talking about the same thing. So Andrew is talking about how AI could empower any business and help any business here. And there is always a small. A small summary or small introduction on the bottom, which takes into context obviously the transcript and the speaker and what it is about. 



So if you would follow what I said initially, you start with prompt engineering and you just say, yeah, summarize the following talk from Endring with a title on how AI could empower any business and then you paste all the transcription, all the transcript. 



There is probably not really what you expect because this is way too long. I didn't even read it, but this is not really what we expect here. So in the next step, we would say, okay, let's do it in a single paragraph and it will kind of shorten the stuff that we have, but still now there is this first sentence which says, sure, here's a summary, but we also don't like that, so we want to add this. 



And there is an iterative, you change the prompt and try to get the result that you want. This is sometimes very tedious and it works not always. It works mostly with larger models. Here it's a 13B which doesn't always follow the prompt. 



With a 70, it's probably a bit better. But if you would use the same thing in a fine -tuning step, this is actually very straightforward because in this case, you already have the samples from the past. 



So in the past, we say we had already the transcripts and the small pieces of context of the summaries. So we can just go ahead and fine -tune exactly in this style. And here I'm using LLM Studio, which is this open -source product by the KGMs. 



And it allows you to easily import these data sets into a non -source product. one into a no -code environment, and allows you to train any of the open source models which are currently out there. One would be the Lama 7B here in this case, and you can select all the hyperparameters that you can think of, select which parts of the dataset you want to use for your input, which are you using for your output, select extra tokens for the prompt start, for the models to learn where the instruction will start and where the instruction will end. 



So there is actually the knowledge for the model where to put the answer. And then we hit off an experiment and see that this is now queued, and it will be started in a second. And yeah, LLM Studio can also queue up multiple... 



multiple experiments, so if you do some overnight, that's totally possible. And then the question obviously is what do you tune first? I mean, you saw, I don't know, 50 hyperparameters to tune, and you probably want to know what do you do first. 



And I would always go same steps. Try to get a good prompt first. So use the ones that you maybe even got from prompt engineering, which works best for your case. And then evaluate often. You could do that even sub -EPOC, so it's not always needed to do that in the end of one EPOC, even if you're only training for one EPOC, but evaluate every 10% of your data. 



Tune the learning rate, because that has, even with Laura, quite a huge impact on the overfitting or underfitting properties of your run. The default in LLM Studio is 0 .0001 with Adam W, which is usually a good first bet, but sometimes, especially with a lot of data or very, very small data, it makes sense to change that parameter. 



Then spend some quality time on modifying the prompt. What is the model actually seeing here? Does it make sense to add additional context to that, additional data? In this case, we had the name, we had the title, and the transcript. 



Maybe it makes sense to even add more stuff, like the background of the speaker. It would make a nice story for the introduction. Then there is Laura. I mean, this is very well known by now. But by default, you have usually a very low rank for the Laura matrices. 



But it could make sense to also increase this rank for certain applications. But there is no way to tell if a higher rank is better or worse. That's why the evaluation metric is so important. And then only in the very, very last step, and that would require, I would say, a couple of experiments first. 



Only then try to scale the model or try to exchange the model to something else. So if you did the 7B now, you can go up to the 13 billion or even some other fine tunes of Lama. And also in the end, if you're interested in cost saving, you may even want to try to scale all the things down to smaller models. 



And maybe you're even good with the results then, and the drop in accuracy then. So when we did that, This is still very, very default parameters. We see the default learning rate and how the input is being shown to the model as here with the prompt token plus the token nice text. 



We could even download and push the model and evaluate the model within LLM Studio. So here it's loading the model on the GPU. And we are prompting with the sample that we just had. So and ring plus the title and the transcript. 



So then, yeah, this is generating an answer. And I would say this is very much in line of what we saw in the example. So this is very short. It gets right to the point. And if you would go through the context, you can also see that this is actually referencing to what is in the talk. 



And from there on, just play with the hyperparameters that I mentioned and make sure to understand how the model can or how to improve on these results. So what are the current limits that I mentioned in the beginning? 



These models, especially if they are not using for summarization, but rather for something like expansion, creative writing, they are very prone to hallucinate. So models, they will just make up any facts just because they sound great. 



And if you ask them about a coding exercise, for example, and you ask for write me a plot which does, I don't know, some transformation, it could make up a PiPi package just on the spot. And this is very common with LLMs and one of the limitations that we are currently facing. 



you The second big one is the lack of reasoning, which we also slightly tackled in the beginning today. So with their eloquent output, LLMs can easily fool us that they are reasoning well, whereas they are only trained on the text and they try to make it sound just like the text in the training corpus. 



So they don't really have a deep understanding of the nature, the physics and math of the world, and they don't especially have any deeper goals of trying to help someone or so. It's just all in the training corpus. 



And there are a couple of more issues like you have all the bias which comes from the training corpus already baked into the model, so you need to make sure that this is properly being handled by you. 



And these models may even struggle in very certain applications. They have a very harsh cutoff. So if you don't use any additional context, this may be hindering your results. And one thing to consider if you're using this model for public applications, that these are easily being used for jailbreaking or prompt injections. 



There are obviously cartridges that can be set in place, but also there is not a 100% solution yet. For hallucinations, I already mentioned a few pieces of advice which can be used to mitigate those. 



So there are no 100% solutions to mitigate hallucinations completely. And they are sometimes even very, very hard to spot. So the two most important things are to provide more context, provide the actual context that is needed to answer here, because that would ground the answer on this context. 



And it would also help to give a more general abstraction first within the prompt to align the model to exactly this physics problem, for example, that you have. An easy one is to get to larger, better models, because they tend to hallucinate less. 



But even this is only a workaround and only a byproduct of a better loss here, that they are actually hallucinating less. Same with low temperature. So here you're reducing the entropy of the model output for the cost that might be a bit more dense in its answer. 



But yeah, some very good recommendations that André Carpathia always shares is that you should always use LLMs in its domain where it's meant to be. and probably better to use as a co -pilot over some autonomous agents, at least yet, or keep the agents in a very confined space or sandbox. 



The second one about reasoning, there is this very nice theory about system one and system two thinking, which is used for humans, actually, but it can be transferred to LLMs, actually. So system one is this intuitive thinking, which you don't really think about anything, you just have your answer right away. 



And this happens for some easy math, like two plus two is four, you wouldn't even think about it. Or if you just drive on a road, you also don't think about how you hold the steering wheel and how much pressure you put on your pedal. 



So it's always just very intuitive. And the second system is the one that you... You need to reflect yourself. What are you actually doing? What are your long -term goals? You need to maybe get more information, because the information that you already have is not enough. 



So you need to ask more questions. And this is completely not there for LLMs. So LLMs are, in my opinion, completely only System 1 tools, which can give you a very good intuitive result. But as soon as this is about deep understanding of something, you need to guide these models. 



And there are a couple of papers on that topic. So apparently LLMs do have an internal state. So if you asked LLMs to play chess, this works reasonably well. And one could think they must have some kind of reasoning. 



But apparently it only has this internal state, which is updated from the steps before, and then it can make sense out of that last step. But it's still something intuitive. And that's right how chess grandmasters are playing. 



If they are playing on a level of 1 ,800, they can actually move super, super quickly. And it's rather intuitive than being a thoughtful process. And this is exactly how LLMs are currently working. So what you could do to trick the models into being System 2 is actually give them more time to think. 



And this can be done by giving more tokens, because the model, they always have the same time or the same compute for each single token. So if you ask a very long or very complex math question and ask just for the answer of that, it will probably get a wrong answer. 



But if you give it more time by doing a chain of thought or even reflection, the answer is more likely to be correct, because the model can plan out the path first. and then go step by step through the calculation. 



The same thing is correct for decomposition. So if you have a very complex question, which is maybe a multi -hop reasoning, this can be cut down to smaller pieces, to smaller hops, where the model is then able to solve each piece by its own. 



And in the end, if all these pieces are written down by the model, it can also again give the summary around these pieces and give the final answer based on the intermediate steps. And the last one here, which I want to introduce, is the step back prompting, which is also doing the same thing as this, what I earlier said about giving some kind of abstraction. 



But here, it could also be an automated way. So every question, every prompt that is given to a model can get an abstractive... introduction first. So if you're asking for a very specific problem or riddle in physics, it could also ask first about the backgrounds, which physics law should apply here, etc. 



So if all of this is generated then, and it could be also again an iterative manner, it will lay out this context again and the final answer will be way more likely to be correct. By using all of these tricks, I think you should be much better prepared when working with LLMs and to solve the current tasks that you're facing. 



And if there are any questions, please come up to me and I'm very happy to answer. Thank you.