Return to page
h2o gen ai world conference san francisco h2o gen ai world conference san francisco

Building, Evaluating, and Optimizing your RAG App for Production

Jerry Liu is the co-founder and CEO of LlamaIndex. As the author of GPT Index, Jerry explores data structures used by GPT-3 for external data traversal. With a strong academic background, including a summa cum laude graduation from Princeton, Jerry's expertise spans applied AI, machine learning, and cutting-edge projects in autonomous vehicles.

Read the Full Transcript



Hey everyone, Jerry here, co -founder CEO of OneMundux. It's great to be here. Thank you for the invite to H2O. Today we're basically going to talk about how do you build a production RAG system? So basically use LLMs on top of your data to build something that's actually production quality. 



Maybe a quick show of hands. I'm just curious, how many of you guys watched OpenAI Dev Day yesterday? OK, the majority of you. That's great. So I submitted these slides before that. So if you have any questions, I'm happy to answer it. 



So yeah. So maybe at a high level, I guess the text is a little small. But we'll first talk about what are the enterprise use cases in GenAI? And as the year has gone on, ChatGPT was released around this time last year, and OpenAI Dev Day just happened, a lot of enterprises became interested in GenAI. 



And some of the core use cases include document processing and extraction, which is the left side. Conversational agents, knowledge searching QA, which is the middle section, as well as workflow automation, which is something that is kind of more of a promise and something that's been adopted into production. 



But the idea that LLMs can actually automate knowledge work across your entire pipelines without a human in the loop. So the majority of enterprise use cases these days probably focus around this idea of RAG, retrieval augmented generation, or being able to build some sort of search and retrieval agent of your data. 



OK. So if we look at what the current RAG stack actually consists of, we can kind of talk about that to start with. How many of you are familiar with RAG, by the way? OK, I feel like that's most of you. 



So we'll go through some of the basics, and then talk about where this potentially doesn't work, and basically the steps that you might need to take in order to actually think about productionizing these apps, make them reliable, performant, and can handle like a wide variety of edge cases. 



The current stack for building a QA system is that you basically have two main categories. You have data ingestion as well as data querying. Data ingestion is the data pipeline into your vector database that you need to set up to even get this up and running over your data. 



If you have, for instance, a bunch of PDFs, invoices, documents, CSVs, spreadsheets, whatever, and you want to transform it in a format that you can actually use, one basic idea is that you basically put it into this transformation pipeline that will chunk up your documents for you and put it into a vector database. 



And there's a lot of vector databases out there these days. The next step is that after it's actually in a vector database, you can query, retrieve, and query it. So you can, for instance, given a user query, do retrieval from the vector database, find the most relevant chunks, and then pass it to an LLM to actually synthesize the answer. 



This is RAC. This is basically what most people are doing. You can do this in five lines of code in LamaIndex. Oh, by the way, what is LamaIndex? LamaIndex is a data framework for building a prototype and also production RAG. 



So a lot of the concepts that we'll talk about in this talk, it's all fully captured in LamaIndex, right? So the idea is that you don't just build the simple stuff. We have the tools for you to build the robust, hard stuff. 



That's going to take a while. So that's the RAG stack. But let's talk about some challenges with naive RAG. One of the first items here is that as you go through and set up the system, you're going to run into a few different issues. 



If you've set up RAG, you realize a key culprit is actually bad retrieval. Given some sort of user question or user ask, like, oh, hey, I want to compare these two resumes together, or I want to synthesize information from part A of this document and part B of the other document, you'll realize you might not actually get back the relevant things that you actually want. 



And if the LLM doesn't have good retrieved context, it's not going to be able to synthesize the right response. So there's a few aspects of bad retrieval. There is low precision. Not all the chunks in the retrieved context are relevant. 



Leads to hallucination, loss in the middle problems. There's also low recall. So not all the relevant chunks are retrieved. So for instance, you might retrieve some set of items, but there might be some actual context that you should have retrieved. 



And if you don't retrieve that, then you basically just don't have the right context for the LLM to synthesize something. And then lastly, you might have outdated information. And there's other symptoms, too. 



But this is basically just like you're not getting back the results that you want, depending on the task at hand. The next aspect is that even if the retrieval itself is good, there is aspects where the LLM might fail. 



It might not generate the right response from you. And so there is a few aspects like there's hallucination. The model makes up an answer that's not actually in the context. If your context is too long, the model might not actually be able to attend to stuff that's in the context. 



There might be toxicity. the bias stuff that the model makes up an answer that's harmful offensive. So what can we do? What can you actually do to fix issues of bad retrieval and bad response generation? 



These problems, by the way, are mostly relevant if you've actually built RAG systems. So if you haven't, I'd highly encourage you to actually just go in, try it out. You can use the newly released OpenAI Retrieval API, or you could use Lama Index. 



See, build a basic RAG pipeline over your data, and see where it breaks, because that's where you really feel the pain points. You really want to understand this isn't working in this specific way. The main issue with optimizing your RAG pipeline is that there is a ton of these different aspects. 



There is data. There is embeddings. There is the retrieval part. And then the LLMP actually comes at the end. And if you take a look at most of the steps of this pipeline, it has nothing to do with the LLM. 



It's just like the parsing transformation strategy, the embedding model. the retrieval algorithm, all this stuff is kind of like existed either in an algorithmic way or part of like information retrieval before LMS have existed. 



And so I'll walk through some of these basic steps, right? To talk about some of these components that you can basically try optimizing for. And then, you know, like this might be relevant, like before you actually try anything, you need to define a benchmark. 



I have a very short section on just like, how do you think about evals? But we have a lot of resources in the docs about how do you actually do end to end evals and also retrieval evals. The main thing is if you're kind of like throwing spaghetti out the wall and trying out a bunch of different things, you at least want some sort of like reliable, quantitative benchmark to validate that whatever you're trying actually works and makes it better, right? 



Okay, quick note on evaluation. Evals are probably top of mind for a lot of people. Actually, you know, we've talked to a lot of enterprises and basically one of their key asks is, you know, this thing isn't working, how do I measure it? 



And then how do I improve it? So how do you properly evaluate a RAG system? First, you can evaluate in isolation. You can do an eval retrieval on its own. You can also evaluate synthesis on its own. You can also evaluate stuff end to end. 



So like given some sort of user query, go all the way through the retrieval synthesis pipeline until you get a final response and then try to run evals on the entire thing. Let's first talk about like evaluating retrieval because bad retrieval leads to bad results. 



And so a fair approach that you might take is, let's just spend some time like actually optimizing the retrieval algorithm itself. Make sure that's returning the relevant context. This, by the way, is not really something that is kind of like new, right? 



In the age of LLMs. But basically what you can do is, you can collect some sort of data set given a user query as well as ground truth retrieved like document IDs. You could run your retriever, your retrieval algorithm on this data set. 



Look at the predictor. the predicted rankings of all the documents, given the retrieval algorithm, and then compare the predicted rankings against the ground truth rankings, right? And these metrics, like, these are just ranking metrics. 



You can basically, given the ranked list of predicted items, see if the ground truth context exists in there, and you can measure that metric using stuff like MDCG, MRR, Precision .K, et cetera. There's also evals end -to -end. 



So you can evaluate the final generator response given the input, which I just talked about. It's a very similar process. You can have a query and response data set, right? You can generate this data set via, like, humans. 



Like, you can go in and manually just create this data set. In fact, actually, as a starting point, you probably just should do something like this, just allocate, like, some time right out, like, 30 questions, and then, like, 30 ground truth responses for this, and then just run your LLM algorithm over that. 



You can also synthetically generate it, and we have tools for allowing you to do that. Similarly, you run your full RAG pipeline over this data set, and you collect a bunch of eval metrics. And then you can basically measure the quality of the predicted response against the ground truth response, or by itself, and see if it exhibits stuff like hallucination, toxicity, bias. 



You can basically define a variety of different evaluation metrics on which to evaluate the quality of a predicted response. And a lot of techniques these days just use an LLM to evaluate other LLM outputs. 



OK, let's talk about optimizing RAG systems. I've had some version of this slide in the past, but I've since refined it to focus a bit more on the basic stuff. And so actually, this spectrum basically shows you how do you actually optimize your RAG system from very simple things to harder things to do. 



This includes what I call table stake strategies to try, like better parsing, chunking, prompt engineering, and customizable models. This includes more advanced retrieval techniques, so like structured retrieval, metadata filtering, embedded tables. 



This also includes fine tuning and agents. I added a few more sections onto just very basic things that you can try, and that you probably should try either manually or in automatic fashion to see if you can actually just bump up the performance of your RAG system, assuming you have an eval benchmark in place. 



First is chunk sizes. Tuning your chunk size can have outsize impacts on performance. So more retrieved tokens doesn't always equal higher performance. And of course, if you return a bunch of context and just try to stuff the L on with that, you might run into stuff like loss in the middle problems. 



You might actually get back the right response, even though the correct answer lies somewhere in the context. So there's usually a U -shaped curve from too little context to just the right amount to too much context. 



And this really depends on the quality of the L on. Another note is that re -ranking, if you just do re -ranking on its own, This typically you want to re -rank for retrieval based problems on its own but re -ranking in this context really just means you're reshuffling the order of the context that you're presenting to the LLM and That doesn't always lead to better performance because in the end you retrieve all the stuff You're gonna stuff it in the prompt window anyways And the LLM isn't always guaranteed to just like give you better results I might actually give you worse results just because you shuffled the order of the context Another aspect is prompt engineering. 



I've written about this in the past, but rag is really just prompt engineering It's just kind of defining it in a programmatic way The way rag works right is you retrieved a bunch of context and then you stuff it into the input context window of the LLM And so the outer template the shell that you use is usually some sort of like standard question answering template The text is maybe a little small to see but the template just looks something like you know You have a bunch like here's some context dash dash dash and then you stuff all your context in dash dash dash Here is my question and then input your question. 



Given the context and the question, please output the response. There's just some basic things you can do, right? To try to like customize this prompt template, that's very dependent on the LLM. Depending on whether you use like Lama 2, Zephyr, Mistral, or like a proprietary model like Clod, or OpenAI, there's gonna be different optimal prompts that you might need. 



We've gotten some feedback that all our stuff is kind of designed around OpenAI by default, but with a little bit of prompt engineering, it turns out you can basically make these templates work for a lot of different models. 



You just need kind of sort of like conditional prompt engineering, depending on the model. Some other stuff that helps, adding few shot examples to RAG. So basically adding a section where you show, here's an example of another question, and then here's an example of a correct answer to that question. 



You can basically demonstrate to the LLM what a correct response might look like. This might help in terms of like style, or in terms of like structured outputs, or in terms of other stuff. You can also, there's a cool paper that came out like last week about like emotions. 



Turns out if you just tell the LLM, this is really important to my career, it does better because you basically make it feel bad for you, right? And then it'll actually try harder to output a response. 



It's kind of interesting about the implications of this, right? I'm not gonna get into the philosophical argument, but if you just care about improving performance, why not? Yep. The LLM of course varies by quite a bit in terms of like how good it is on different types of tasks. 



Yeah, sorry, the screen isn't super clear, but what you see here is basically a giant table matrix where the rows show different LLMs from kind of paid LLMs, proprietary ones like OpenAI and Claude, to open source LLMs like Lama2 and Zephyr. 



Each LLM is basically ranked on easy to hard tasks. In the first column, it's like basic rag. In the middle, it's like taxisequals, structured outputs. And then at the very end, on the right, it's like agents. 



And so you basically have a spectrum from stuff that the LLM, like most LLMs should be able to do to stuff that many LLMs struggle with. What we've typically found is that open source LLMs do a lot worse in terms of one, structured outputs into a genetic reasoning, right? 



And fine tuning might be a way to get around that, but this is just something to keep in mind, especially if you're thinking about choosing a proprietary LLM or an open source one. Customizing embeddings, of course, also matters quite a bit. 



Don't take these numbers as is. This is just over a very simple data set with certain parameter settings. But we have a full notebook for you to basically just run a similar benchmark on top of your data with any settings you decide to use. 



There's a lot of embedding models out there. There is OpenAI. There's BGE. There is Cohere. There's Voyage. A lot of these just came out last week, or two weeks ago. A lot of these just came out last week, or two weeks ago. 



And what we did is we measured retrieval metrics over a data set on a variety of different tasks and also added in re -ranking in the columns. And so you can basically draw this matrix and the higher the number is, the better, right? 



And so you can take a look, see which ones actually fit your cost budget and then see which ones have the highest performance and measure it using the retrieval eval metrics that I just mentioned before. 



Both customizing the LLM and the embeddings have a huge impact on your performance. I mean, this is pretty obvious, but we have kind of a systematic section in the docs detailing the stuff. In the last three minutes or so, I'll just kind of briefly go over some more advanced things you can try. 



One general concept for advanced retrieval that we've been kind of developing conviction in is this idea of small to big retrieval. The main issue with chunking a bunch of text and then putting it into the LLM context window is that the same text is used for both retrieval as well as synthesis. 



And so it feels a little suboptimal because you're embedding like a giant text trunk, you're retrieving a giant text trunk, and you're sending this text trunk to the LLM. One thing we found that actually leads to better performance cross -bord, if you decouple the representation that you use to embed stuff and the thing that you actually use to feed to the LLM. 



So as an example, if you embed sentences, right, and then retrieve based on sentences, it actually better matches in terms of semantic similarity. So you can return more granular sentences when you actually call retrieval. 



But then what you can do is you can actually expand the context window around that sentence before you actually feed it to the LLM, right? So you have a bigger amount of context that you feed to the LLM, and this is a win -win in that the retrieval is better. 



You retrieve more relevant, precise pieces of context, and then the generation is also better because you actually have enough context for the LLM to synthesize a coherent response over. is roughly the same thing. 



And then maybe the last bit is just like, what is the role of agents in RAG? And the way we think about agents is just L -empowered knowledge workers. For us, we care a lot about how agent -to -greasing agent loops, anything people are building with agents and tool use actually translate into better search and retrieval and insights, and actually is kind of the graduation from basic RAG to more advanced RAG. 



An agent fundamentally is a reasoning loop coupled with tools, right? There's different ways of thinking about this. There's some different core components. You can use like a React agent, like the React loop that popularized by the paper that came out last year. 



We also have like, you can just do a function calling loop. Like if you use openAI function calling, you can just do a while loop until it like stops doing function calls. The assistant API that just came out yesterday, that's also basically a loop that handles under the hood. 



You give it some task and it'll figure out what to do under the hood. And you give it a set of tools. And throughout the process, as the agent is reasoning over the task that you give it, it can decide to call one of these tools, which is basically an API interface. 



What we found is agents are really just a layer of abstraction above a naive RAG system. And it comes with its own trade -offs and costs. You're basically adding an LLM at a layer right above where the RAG query layer sits. 



In the process, you get the benefits of agentic reasoning, chain of thought decomposition, and all that stuff. The risk is that it costs more money for you. And it can be less reliable, because you're adding more LLM calls throughout your system to basically try to reason over more information. 



We came up with an architecture on multi -document agents and showed that this actually allows you to solve a greater range of problems than what naive RAG can actually give you. Of course, it costs more money. 



If you're interested, I think the slides, OK, fine-tuning. I don't have time. There's some resources on production RAG fine-tuning. And most of this stuff is in the docs, too. We've written a lot about this on our Twitter and LinkedIn. 



And there's a lot of guides and tutorials across all these other topics. And I didn't get to cover fine-tuning, but there's some stuff on fine-tuning embeddings and LLMs. Cool. Thank you.