Document AI with LLMs
Mark Landry, Data Science & Product, H2O.ai
Ryan Chesler, Principal Data Scientist, H2O.ai.
So, but we're going to squeeze in a quick talk about Document AI with LLMs. I talked about it a little bit on stage earlier today. So what we're doing is essentially modernizing Document AI. We've been using a specific kind of formula that's a good method.
It results in token classification at the end. We're using small LLMs, you know, before the big ones were there for that. So essentially a BERT model fit to documents. But there's a few things that we'd like to just modernize as I mentioned before while we're getting set up is that we've just noticed customers sometimes a lot of documents are simple or the LLMs understand important parts.
You can send a chunk of a document say what is all this and a large language model will generally do a pretty good job. You know, and so we've, we're trying to look at, there we go. All right. Just the header slide and now we'll immediately go into the application.
So we are building the new version of Document AI to harness the power of the LLMs. If I can get, there we go. So you're looking really quickly at what we're working through. So I've got Ryan Chesler and Shaqik Ghalib as well.
There's a few of the data scientists here experimenting with a specific prompt tuning is what we're going to talk about a little bit. Ryan will talk about some of the internals of what we're doing there quickly.
So I'm just going to flash up on the screen here. What we want Document AI to be is easy. It's fairly hard to go through all the processes that we've been doing before, getting OCR done, bounding boxes, annotations, predictions, training models, all this kind of stuff.
And we really want it to be easy as easy as LLMs are kind of to use for all of us with the chats. So here, you know, zero shot is really powerful. Just try what is the invoice title and what I've got here.
What's the company? What's the date? What's the address? What's the total? We've built this into here for speed. I'm not even going to show you that, but we just made it really easy for people to type in annotations and just go.
And this means you can also import data from the back end. It's more CSV labeled. And with our history, we weren't. And it's very painful. And so this is really not a good idea. nice and refreshing to be able to just type in the labels here.
And we start with, I've got 10 receipts up here, and I've already run, I ran a model on five new ones, I've also run on the old ones, and we can see these predictions coming in and they're pretty good.
It's already learned, we're enforcing the JSON format here, but it's got the total right. I'm gonna flip really quickly to one so that you can see, you know, this is one of the ones that the model's not seen.
It's seen 10 receipts like this. It's a different receipt structure here, but the model's already got everything ready to go. It's got what we're looking for on here. And so we can flip back and forth, but what we're testing really on our team is to look at some of the harder ones too.
So it's easy to look to go through. We've looked at driver's licenses, we've looked at those invoices had four targets, but what we really need to battle, a lot of our customers are battling is difficult stuff with line items here.
And so where we're doing right now, like in -flight testing is really bringing in a lot of old data sets from our experience with customers and trying to see how far we can get. with LLMs in generally prompt tuning a method.
What we're doing, Ryan's going to talk about in just a sec, is enforcing the output format, getting what the LLM knows. So I'm asking here for invoice number, invoice date, supplier name, very common stuff we need for supply chain problems.
These are real type stuff that we've been doing with customers and trying to get the LLM to use its knowledge of everything that's been trained on. It's seen documents like this. It understands invoices.
We just really want to go the last mile of training exactly what we want sitting on those fields. So let me quickly turn it over to Ryan. Thank you. Yeah, so we've been working on document AI stuff for several years.
And we've got a process that works pretty well. But the biggest pain point has always been the labeling. So someone asks to sit there and annotate all the boxes and say, here's the text that we want to extract.
And so LLMs have been extremely powerful because they can do this stuff zero shot. And so you go and you just plug this document in and say, extract this information, extract that information. And very often, it will actually give you a decent result.
But it won't follow a schema that can be easily extracted, or it might not do something in exactly the format that you want. And so what a lot of people have figured out is that they can do prompt engineering.
So they go and they ask some very specific string, and they say, here's the question in order to get the result that I want. So you might start with something basic saying, just extract this field and that field.
And then it gets you something. But then you can't pull out the specific things that you want really from it. It's a string that you can read and you can say, yeah, that's correct. But you can't plug that into some system.
And so you might go and you find some other string and you say, OK, turn this into JSON format if it's blank. Say none. Do all of these different things. You do train of thought, all of these other tricks that you can try and do to get better results.
But you might still struggle. And so this technique that we've come across, prompt tuning, is basically, you know, a string. show it some examples, just show it, do a sort of training procedure and say here's the input document, here's the format we want out and then it will just learn it so you don't have to figure out this whole how do I phrase this and exactly the right way to get the model to do what I want.
So prompt engineering, it's asking the right question to get the behavior that you want out from your model and prompt tuning is kind of doing it by example. So sometimes it's actually easier to just say okay I'm going to do this process or I've already done this process ten times, just show it the output and then it will do its own optimization to try and get the right result.
Sometimes it's easier to learn by example than it is by direction. If you wanted to actually get the full thing, the schema exactly right, you have to do this whole paragraph of text and it still might not even listen to you and this prompt tuning is very good at actually following the format that you want.
So prompt tuning. is basically the prompt engineering is that first layer of like, I'm going to optimize the words. I'm just going to change the question that I'm asking the model, and prompt tuning is going one layer deeper.
So those words are actually turned into embeddings at the next layer, and those embeddings are a bunch of numbers. And so we can just vary those numbers in order to get exactly the right question that we want.
It might not even map directly to an English word. So like at the beginning here, we saw, here's what the prompt tuning came up with. That's the closest words that map to the embeddings that it found.
And it's not readable to us, but that's the internal language of the model. And so if you plug this in, it will actually get you a much better coherence with what you're looking for. So we looked at all of the research, and this is kind of what we've stumbled on.
And it has a couple of properties that we really like. So the prompt tuning, it's very, very good at low data scenario. So that demo, it's really compelling when you go, okay, I'm going to label 10, and on the 11th, it's already filled out in exactly the format that I want.
And then it makes your labeling process much, much easier, and it will continuously learn as you label more. And so the prompt tuning, you can see that there's this big jump from prompt design or prompt engineering all the way up to you get nearly full fine tuning performance out of this prompt tuning.
And then the other positive out of the prompt tuning is that since you aren't actually training the model, you're training the input that goes into the model, you don't have to swap out models a bunch.
So if you have one really strong deployment of 70B that can handle a whole ton of capacity, you just deploy that once, and then you just pass these different inputs to it. So you don't have to say, I fine -tuned my model, now I have to deploy a different 70B model.
And so we're really big fans of this prompt tuning technique, and we've been finding good results out of it so far. And I think that's all I've got. Great, thanks everyone.