ON DEMAND

Getting Started with H2O Document AI

H2O Document AI makes highly accurate models by using a combination of Intelligent Character Recognition (ICR) and Natural Language Processing (NLP) to leverage learning algorithms for optical character recognition (OCR) and document layout recognition. In this webinar we'll cover an overview of H2O Document AI, demo the product and talk about successfull customer use cases.

3 Main Learning Points

What H2O Document AI is and how it works
How H2O Document Ai is different and the new use cases it opens up
How customers are succeeding with their use cases

Read Transcript

So what we're gonna talk about in this about an hour here, just a quick product overview and kind of showing you what the tool is designed to do a little bit of how our tool does it, and then spend some time actually just looking at it. So we have a product demo that a few things, we're going to walk through how we look at documents, as datasets, and the different ways we manage those inside our zoo. As you'll see, our tool does some annotation helps you create the labels, so we can get targets, which is not common with documents, actually.

And so we built that into our tool, the modeling, probably a lot of people are here to see what that looks like, and the feedback. And it's kind of like some of the insights that we've taken, as we've stood this up for a lot of different use cases, both on our own and with customers, and then publishing a pipeline, how do you get the job done, and then we'll save some time for q&a. Okay, so a short documentary overview. This is kind of a process on the left. So you see kind of our capabilities on the bottom, and then the tool that we'll talk about is in the middle. So on the left, documents, images, documents as images. So this is going to be typically PDFs, or images, cell phone cameras capturing so much these days, you might get a clean PDF from the internet or anywhere where you can really already copy and paste the text, we're built for that. We're also built for again, like if you saw the things when you got in here. And UCSF has faxes it to this day, 1.4 million faxes, multiple pages each one, and they come in as images, they're coming in from fax machines. So we don't, we can't just read the text. But it's still a PDF, it's wrapped up multiple pages. But they're actually really just big images coming in through the fax, and then images of all different types, working on Word documents, things like that. So take documents on the left. And that's where we'll start. And then we get into the process that you see. So we'll talk about some of this here. So there's a little bit of preprocessing more and more that is built into the tools, we use some really good optical character recognition models these days. Labeling is a big part of it. But most people are going to be trying to train a model. So you can use models, and it's actually really models plural is correct. It's very frequent, half of our customers use two models in the same pipeline, actually. And then some post processing, taking the results of those models do bundling that for the specific actions that our customers want to do. And then deploying that. And so that's going to be the last step that we look at today. How to get us out of this. So our, our software is built for it was originally kind of data scientist and annotators. So people creating training data, those two camps, but you know, the intent of our tools to be simple enough to really ease that up so that we can, you know, we can train other people familiar with data problem, but not necessarily data scientists to be able to operate this as well. But deploying is where we really unearth the value. And so build a model, build a pipeline, and deploy for use cases that are typically things that are humans are doing today, all of our customers think I can say all the use cases we have, are doing something that humans are doing today, or things that they're looking at humans as another way of doing that. Because we can unlock some of the value that some of the template machines, some of the template processes don't do. So we'll talk a little bit about that as well. But the deploying is really what we're aiming for here, and then consume. On that side is we're we've got a pretty simple REST API for deployment. And so that means you can consume it where you want, you can throw it into a data store, build a user face on top of it different, you know, integrate it with business applications, other tools, and we've got customers doing every one of those action. So down on the right, you have human a loop review, in a correction, that's a pretty big part, we've asked a lot of questions about that we built that into the application. So that the models, you can keep a pipeline going. So you can stand up a model, and then work it better and better and better as you collect more documents. And so we can annotate from scratch, or we can annotate from predictions. And so that's a key component to what we built here. So and then running up the middle, we have cell service labeling, as talked about labeling, validation refinements. So annotation, there's a lot of different words for this. But if you're familiar with tabular data sets, it's very typical. Not always for sure, but always it's typical that we have the results collected in some data store somewhere of what the answers are. And someone defaults on a payment. Over time, we'll figure that out right away. But, you know, it's very common that we have all of our targets collected, and Document AI, most of our customers do not. So. So an annotation is a big part of our

training user experiences. So there's two different model types that we focus on, we can do a few more. These are the two basic ones that we do. And these are common, I'll show you in the next slide or entity extraction, what that really means it's going to be probably what most people think I want you to think of this, but I'll explore that in a little more detail and page classification as well. And the integrations like I talked about so let me show you what we're talking about. So this is an example. This is Bob Rogers from UCSF again, if you sent him a video beforehand, talking about a medical referral is was our first use case, actually, with this product, we've gotten experience for a couple of three years really prior to that working with some customers hand in hand where we develop the product. The first time we really used it was with UCSF on their Medical Referrals. And this is a simplified version you see on the screen, it's a clean output, their faxes are not clean, as clean as this. And we've actually got a few things we're going after they're actually going after. not unique, different things are going after. But here we see we've got an anonymized, physician referral coming in. So this is a physician group sending documents sending these referrals from one practice to another. And so these are coming into UCSF. And that's important when we're talking about UCSF doesn't control this form. Right here, they we are taking this from looks like that Health Medical Center. But all the different facilities around the country are sending it usually on their own paper on their own format. So they don't really control the format. So we're not talking about templates. And that's really a big key of where the strength of ratios document it is. Most of our customers have at situation, they get dozens, hundreds 1000s of different formats of the same document type. So we're talking about a medical referral, but it's going to shift form, as you see it in different coming in, we don't just have one format coming in. So what we're doing here, as you can see, we've got a color coded over here and on the right with what we're trying to find. So on the right, so date of birth, so we've got the value coming in. And we're predicting patient DOB with a high confidence, and so here we picked up. And then we can also show you where it is in the form. So we're going to take this document and page by page look after a specific schema, if you will. So a bunch of classes that you've thought about that you want to pull from this medical referral. Like I said, when we started talking with UCSF, they had already figured out there's 116 things they want to find from this document, you know, this is typical data science, looking at your use case, what do we need to take out of this document, so that we can digitize this and act on it. And, you know, typically, the smaller the better, the smaller, the easier it is to label that actually, these models, like I said, with 116 classes, these are deep learning models, and they can unlike GBM, or anything, you know tabular typical tabular models, they can handle these just about as efficiently with 116, or even 1000, as they can with five or six. So. So we're going to specifically look at a schema that matters. So we're taking these referrals, we're pulling specific information out of those referrals. And like I said, we can start with one document, here. And if we always, you know, if we always looked at this specific format, that would maybe be pretty easy. But then it's going to shift and here's a different one, you know, this is a totally different format. If we compare kind of these to kind of bounce back a little bit before, you know, they're each forms, it's similar content, it's not going to be the exact same content use out of those 116 things they're going for, you only find maybe 30, or 40. On a given document, there's lots of variations in there of what they're looking for. And so what we're going to do here is this is where H2O document really is a strength that we've aimed at this target since day one, because we have similar bank confirmation use case where all sorts of banks were sending documents into BTC. And so when we did that, we thought very clearly about the schema and then using models and we've upgraded this over time where we're at now we're using advanced deep learning models, we're using transformer models, for those of you familiar with that. And the ones we use so high-end NLP models. So at the basis that you were using natural language processing model, very similar to Bert for those that are familiar with that. But the big difference is that these documents are not reading in typical reading order paragraph, sentence kind of format, left, top down left, right are the different reading wherever different languages.

The structure of the documents, the way that humans laid these out to be concise actually is usually what it's for, with tables forms here we have left, right, if we look here, we have a little bit of top down, you can see outpatient psychiatric clinic, and if we just read this left to right, we get fax completed form to and then it goes to data referral. And we're going to kind of make a mess of some of this address information on the left is going to start getting in the way of contact phone, things like that. So the models that we use are basically a Bert model natural language, high powered natural language processing models state of the art still different flavors of Bert but that the Bert algorithm is here with key information about the pixels of the location on the screen. And we see evidence when we when we look at these models when we use the right models. We see them understanding the structure of documents, you see it be very refined about understanding left right table structure, so that it knows the column is on the top, and it reads an entire column top down. And it really helps the model classify each of the information from those tables. Also, we're using transfer learning when we do these. So this is the lanolin models that we use, and other variants are all pre trained. So this one is actually trained on 11 million documents. So it's seen the structure of documents and difficult documents, in fact, not as clean as what you see here on the screen, I'll show you some of those meetings with the demo. So, so transfer learning for the data science out there, that's a really key factor, we're not starting from scratch, we've got a pre trained model that seen 11 million documents before. And that's where we're always going to start with and that allows us to start fast. And so we can also so do the other formats. And then we can also see different document types. So that's a physician referral, we've got more and more use cases that we see with customers. And to some extent, we're not scared of anything, you know, bring in a specific document type. And we'll just look at you know, there's different pros and cons of some of these, this one in the middle, the deserts of get here is going to be a little more difficult to read, but the OCR that we're using is going to be up to that task. You know, we can handle handwriting, the accuracy will low as with everything. But we do use models that can handle handwriting, we're working on other languages that are there OCR toolkits, so, but different document types. And they can handle different formats within those document types as what you'd want. So typically, if we look at this invoice, death certificate and receipt, we'd want a model for each one of these for each document type, you don't have to, you can actually probably get away with an invoice model for receipt, we've seen that. But typically, you want to look at handling a single model for a document type. Okay, so now we'll look at the tool. And we'll continue to lay the foundation, I know kind of went too fast through what we're talking about here. So some of that will settle in here. So what we're looking at we created to work with UCSF, we had a back-end data science library, I'm not really going to show that of the API, we're gonna see a UI that sits on top of it and operates it. And so we're going to talk with the make sessions really take a take a look, you know, in the next half an hour. So take a look at how we structure this problem. And also some of the things and show some examples of how we walk through it. So I'm looking at this is in our cloud, so H2O ai cloud, we're using a cloud that we have internal to H2O, to demonstrate this that we see with a lot of projects, but I'm going to kind of try to do a create one from scratch. So you can actually make these jobs will run pretty quickly. But we're also going to bounce back and forth between some seated because I'm not going to live any take 240. So we're going to pick up in the middle at some inflight projects here. So the basic, very simple UI, like I kind of said, you know, it's maybe different document types can be a project can be anything, anything you really want. And so let's go ahead and create a new project.

And I can get started. So creating from, from scratch a new project, I can upload my files here, if I'm ready, a lot of people are interested, and they want to do that we can also upload them later. So I'm gonna go ahead and upload them later. And I'll show you a little bit why. So we created our document set and live is we've got that. And so now I'm going to upload my documents set, like I said, so first find it, here we go, I'm ready, I've got just two files here that we're going to track with, with a project we've got set up that has more documents. So typical kind of things, we can name it here, we can see a lot of different entities copy attributes from would be when we're having mature project. So my touch on that a little bit later. For now, let's just get started votes and documents. What showed up there. I've got the small batch. And there's just two documents in here just for speed. And it tracks with I'll show you what we wait and won't take long for this to upload process. But it's going to track with pretty much what we have here with these medical app test. Sorry, actually, I think I did the other person. So same thing, different project. We've got a different document set. This one has 21 documents, 87 pages. So there's the first clue. We have some PDFs in here. We have some documents that have multiple pages. And so very common, we get a mix of all sorts of things from customers, and then mix within there. So I can mix up within this document set here. It's a zip file. And that's a key piece for trying to learn where we're at right now with Document AI this throws people off. We're looking for one file when we uploaded the documents set. And so it'll say that in here and we'll look to expand that to me It's a little easier to find a lot of people trying to upload a whole bunch at one time. So wrap that up in a zip file. And we can load multiple documents at the same time. So let's go back to here probably loaded by now. Here we go. And let me also show you what happened I was I kicked off a job. So this is live. We've got that failed. Because I loaded two of these, I tried to practice just before this, apologies for that. So we kicked off a job live. And you can watch these, and this is an important part of how we've set up the stack. There's, we use an internal tool ml API to kick off Kubernetes jobs to do this. So the product is designed for scale. It's designed for cloud, it's designed, each of these jobs are independent. So we can load multiple at a time, and they are intended to scale. So with each of these jobs, and I see this kind of this is a common interface to get this model. And so we just usually see the logs here, we can see where it's going, when it's going, we can see what's processing and what succeeded, what failed. The reason these failed by the way to make it clear is that I upload them twice. And we have some of those, all those documents are already there. And so it doesn't run into itself, we have one version of each of those in the format that we're looking at. So moving it you can see I kind of glossed over didn't really mentioned at all. So we have our project here document sets going from a top-down document sets, annotation sets, where we'll spend most of our time that's kind of our data. So I mentioned kind of working with documents as data. When we load our documents here, we're kind of done with the documents. That's a simple thing we want to do, and we can run the OCR against them and start creating annotation sets. So let's do that kind of carpet alright. The OCR pane, here. So if we look at the basics of what we want to use, you want to do upload documents.

Create targets, annotations, the answers, you can import them, if you have them, we've worked with clients that have that convert them to a couple of different specific Yeah, with a couple different common formats. Or create those and most people are creating again with the annotations. So again, upload the documents that create the answers, the targets, the annotations, and then train a model, those are the basic three things we want to do. But there's an extra step in here, those natural language processing models need to see the language in here and they need to see the tokens. And so the one thing we need to do is run the OCR process, really, we've called this intelligent OCR. Because like I mentioned before, if you have a modern document and typical PDF that's very clean, there's no need for us to actually use computer vision to read it in an OCR way, we can just use the embedded text, the embedded text is sitting there in the document just waiting to be read quickly. And without error. It's the way the documents meant to be read. So when we get an image from a camera phone, or those faxes I talked about with UCSF referrals, we don't have any limited text. So here we're going to use something that's dynamic to run the OCR. And so again, that's optical character recognition. We're using state of the art OCR models, our team's been working on these last couple of months to further improve the models that we have. It's a pretty exciting process, maybe for another day. For those that are familiar with editingTorch, we're looking at two basic models that are running through here, we're going to read everything like a big image. And then we're going to segment the text find all the words on the page. And that's a key aspect of documenting I we're looking at things as a token, and we say token we don't mean character, we mean word and what word really means because a lot of money even words like ID numbers and things like that. They're space delimited tokens. So that's a key aspect of what we're working with here. And then talk about I'm going to kick one of these off and then I'll explain a little bit, so it'll run pretty fast but…

okay, so I have queued that job to run. Let me show you what that looks like to in the jobs. So here we have a pending job. You can click on that we can see as this progresses and it's gonna keep all timestamps very typical kind of stuff. So but let me go back and show you what that OCR interface and I can do this again and again if I want to doesn't have so with the OCR method, which I didn't show. This is where we're dynamic with what we can handle so we have a few please shows up on the screen. Sometimes it does. Sometimes it doesn't. But explain what we have right now. Tesseract, PDF, text extraction, Doc TR, invest. And so really looking at Tesseract very common office shelf library is what we use for the first year. So I'm doing this actually probably three really. PDF text extraction is that mode where I said, the text can just be read from the document, we don't have to involve computer vision. And then doc TR is the higher-level OCR library, modern computer vision, like I was saying with the segmentation, and then recognition, every token is going to recognize what's in there, including some handwriting, we're working on other language models. Right now. We're working on improving those two base models of being more precise about getting the tokens and better about the recognition. So is a very active part for us. This doc TR was a starting point, but it's an H2O fork of really good library, duck TR, we're making that our own right now for the past couple of months, and some really promising results there. And what that means is if you've used OCR, you know, in the last few years or so, you know, this is something that gets better and better tracking with computer vision getting better and better as well. And so, you know, for those familiar, the M NIST digits, you know, have been a really good solution for deep learning for a long time. And yet, the OCR tools are not as clean as you might think from seeing the zero to nine. So a lot of the challenges when we're getting documents that don't look so clean getting camera phones, or just images of any type, you know, poor quality scans that have multiple documents on top of each other. The deep learning methods really thrive in that environment. And if you're familiar with Tesseract, and abuse OCR, again, being any off the shelf, kind of OCR, you know, I think things have improved. And so we see that we can get good results against some pretty difficult documents. And so that's the OCR method here. But this the best is something that's dynamic. And so like we saw earlier, here, I've looked at two documents, I've actually got an image, and I've got a multi-page PDF in there. And we can dynamically use the right tool for every page within each document. And so you can use Tesseract. Specifically, if you want to use doc TR use PDF text extraction. But most of the time, people want best unless they know like UCSF with those faxes, there's no embedded text, we're just going to go ahead and skip even the vetting step, which is really quick. But we're going to look at each page and see do we trust this embedded text if we find it, if so, we're going to use this PDF text extraction. If not, we're going to use an image model. And you'll see these options again, like this is an active area of work. So I think in the next few months, we'll start to get more options showing up here. And when we bring in multiple languages, and other GPU optimized settings in here, that's again, the work that the team is doing, these options will get a little more complex. And we'll try to keep the user interface simple to use as we do that. So this is long run by now. So the document sets, I'm done with this documents that I've loaded mine, I've got two documents, eight pages, I've OCR, I've got what I need. From there, the system now has seen what those documents are. And it's created to have these annotation sets here. So I moved to annotation sets, annotation, again, I've been using synonyms a lot, so labels targets. But this is really where storing all the data about these documents. And so an annotation sets, the one I just created, which is the one that matches the name I use has this attribute of text. But we've already when we loaded deeds, we also pre created one that's set up for you to create labels for whatever project you want with these documents. I'm going to first show you what the text looks like. There's not much to interact with. So the OCR is done, we can see it. Mainly this is just it's a step in the process of getting us to run the model. But here's the first time you're actually seeing these documents. And so what we're showing is an active labeler. These boxes are created by the OCR process, but they're live, they can move, there's not a good reason. But a good reason to do that here. I'll show you why that matters for the labels. So but a little bit of what are what the label can do. So it's dynamic, you know, we can see the documents so I can turn on and off the boxes, I can turn off the labels within them, and or both, or neither, and neither is nice to just show what documents we're looking at here. We're looking at these auto repair estimates, so just pulled from the internet. So these are something where we have a multiple page PDF here.

So page one, page two, let me expand this a little bit. So you can see. This is the plus zero. So it's zero indexed paging, so we have we're looking at I'm flipping through by hitting just the keyboard left and right, but you can click on them as well to go through the multiple pages that I loaded and like I said, we I loaded one PDF, and then this is just an image. It looks pretty clean. I apologize for that. But we have a one-page image that's what this actually was. And so then our PDF so we can go through these. Again we turn back on the labels we see what the OCR is reading these, and this was embedded text. It reads it really cleanly. We can see just how precise some of those are. As you get used to these The boxes are more really, in a line you can see as this one be a little different, a little jumbled. So you made a little bit. So you can see that these are reading every token individually. And it just slightly different, that doesn't really cause any, any difference. So that's what the OCR process has gone and read all of these. And so this is you can see some of what it's read here. So as there's a little bit fuzzy on this one, and so a decent run through these, that there are some mistakes with what it's done. That's definitely we're always working on trying to read it as much as we can, we need both pieces of information, we need the labels to see what most of us would think of as the OCR results, you know, what's the text that's in there. But just as key to that, as these boxes, these boxes are going to tell us where they are on the screen. And that's what document I need to run. And so quickly looking through. So I'm gonna go going back to the annotation sets, you can go back and forth of what we call page view. So I went over that pretty quickly. So I'm gonna go back into this preset one that doesn't have these boxes. And this is where I would create my labels. So whatever I wanted to do, I can set up again, this is dynamic, I haven't started a project, but I can very easily. So I will create a brand got that ready to go label really key that it's under a URL, this is specific lamb hills, they have that text in all lowercase. And that's what we need to be able to train a token classifier. So I'm going to add this here, I change this to a drop down. And I'm going to create some values. And for the speed, I'm just going to create to show you what these are. But I can jump over the project where we've got this ready to go. This is a dynamic labeler. So if we wanted to highlight something like that, immediately, as I created those, I've got my a, actually, let me move up, there's my flips. I click on that there's a beat, and so forth. So you can imagine that can do that through the documents, looking at key pieces. Some of this like actually, what I'm looking at here is the report date, a lot of times stays constant on these pages. So you might find on every page most of the information doesn't. So what they probably be going after is on this table. In this sort of summary, how much does the auto insurance company owe the deductible, that the final payment, things like that. So let's flip over really quick to the one that's already ready. And I'll show you what that looks like. Here we have one that's been labeled, and it looks like this. So we've got some things that are a little cleverer than ABC. So I've created some things that the insurance company might want to see out of this just hypothetical, but you know, something very typical to what our customers are seeing, we're getting some basic information out of the header. Like I said, we're gonna get some information over here, zooming from, from the detail here, so the subtotal the total, deductible, and insurance. So we're thinking classification models at this point. So we're going after this is our classification schema, essentially, these are the classes within something we're calling labels. So this is, we're going to train a model that's going to look at the attribute, as we call it, with the label with these classes, including everything I haven't labeled. So there's a lot of nothing that I'm not interested in for this use case, different use cases might be interested in all of the table tabular data, they might be curious about something, we don't even have headers here, you know, we're gonna get a model to learn what we're doing without even seeing that the labor is what's in the second column. And once you get to see these documents over and over and over, you'll see that they actually follow a pretty common format. This is a different format here, you can see this is not as clean, it's got, you know, it's been faxed or something like that, or copied similar kind of table here, so we can get similar information on them. And so that's generally what we're doing here, we're going through enough documents, we can get a model to learn based on seeing diversity of the same things of the same comment. So vendor, you know, it will start to learn I mean, here, if we think what we're going to ask this model to learn, there's no pretext around the vendor. And if this is your first document, you might not even really get the sense of what these are. Once you get to see these over and over, it becomes easier. And I find we change our annotations when we do this internally.

But this is so I've labeled this vendor, and it's just right there in the middle of screen. And this is so if we wanted to head back to what Document AI is going to do hear using these natural language processing models, the natural language processing models are going to use all the tokens around it all the context to try and classify what we're looking at. And also that pixel information. So as it looks at multiple documents, over and over and over, it's going to start to see that the vendor is the thing that usually appears in the same spot near and duress or something like that it's going to use but we're not teaching it. Unlike some of you may be familiar with templating models where you're going to have the cues, like a form sort of thing damage assessed by and the name is David, if we want to see if it says same format every single time, we may be training on, here's the key word, and then we're gonna move over. But that's not what we're doing, we're gonna go after the specific texts that we want to deliver in the final results and let the model work out how it gets there. And it gets there with volume. And ideally, diversity. So like I say, these, these can extrapolate, they can generalize to different templates, as we move through, you know, most of these are going to be similar, but different. You know, and we want the models to learn that same kind of thing here, we almost have, you know, identical, different format.

But you know, the vendors right there in the middle, and I think that's me not leaving. So we're going to use natural language processing model to look at all of these classes, they're getting specific about how we're working through these, when we go back to the annotation sets, I'm going to show you again with the attributes. So the first one I labeled, looked kind of like this, we had the text attribute, this is the OCR results, and I called tokens over here, I created the initial labels, and put label and class over here, these are the key things that drive the two different types of models that we that we have in a short Document AI label is, again, for token classification, and I haven't talked about it much class is for something to exist if you're classifying each page, and this will be the case with the UCSF referrals, they would get on average about 25 pages in each PDF. Some of it was a referral, we think of it as a referral fax, that's what's coming in. But there's a fax header page, there could be clinical notes, lab notes, insurance, information, all sorts of stuff that isn't quite just a referral form that they're looking to process. And so what we have two models, operating at that point, one that's been trained for each page of the different types, there's about 15, different types that they're looking for. And we classify each page that way. And then we do the token classification model, which is for each of those pages, they're going to run, look at the 116 different classes. So label, the green will drive you towards the token classification class will drag you will be setting up the targets the labels, the answers, so that the model can understand what its topic is what's trying to do. And so haven't our documentation, I do want to mention this one, we're gonna set up at a time here. So we have documentation here. And I'll walk you through these flows, everything I'm kind of saying you can walk into slower method kind of going back and forth. It's going to call it exactly how you do these sorts of things. And so I find myself actually going to the beginner tutorial a lot to show people how this works. So it's, the documentation walks you through the entire process. So anything, I'm going too fast, you can walk back, and realize the ins and outs of how to do that with the documentation that we have. So what I've got two different forms, the only the last major step I need to do here to train a model is to merge these. And so I'm going to make sure, so I've got 21 documents in 87. I didn't actually label all of those. So I'm going to merge those so that our model can see all of the text all those locations, but then we're going to merge it with the answers. So effectively, what we get is the OCR results, after that have been joined to what I how I labeled these.

So here we have a few of these normal this is what the model is going to see. Probably why it's that way, let me show you one set up for this. We'll go through that one, still. So the merge results are going to say the label and the text. And so essentially, we're taking these two datasets. And we're going to apply the labels that's effectively a joint. And this is a case where we build the tool to try to help you pick the right thing. So we have these dropdowns. And we're going to pick what we need to see on both sides to make sure that your model has everything it needs to run. So if you're going to pick up a token annotation set, it knows which ones have tokens. And here looking at token labels, it knows which ones have that green label format inside them. So we can pick up these, we can merge this, which I have done. And we can get a resulting annotation set out there. And we so this is where we're again managing the datasets for the documents. And so we're moving so each one of these rows is sort of representing an individual piece of the puzzle. So we can keep labeling these and then you know so right here, I've merged those two, I've got these two coming through. I was able to that's the prediction. Sorry, that's why a lot threw me out there. So this is gonna look one more same parameter for that area. This is, these are the modeling results that we have. So I've got the OCR results. Here, we're saying what the label is alongside the text. And this is what the model used to see. So it sees the text Bishop, it knows the location, this bounding box body is also the vendor, because when I did it, I highlighted all four of those all-in-one shot made a simple model, simple box, and now we merge that, so it overlaps, we intersect these two. And then this is what our for our model needs to see. And so that's one key aspect of what we're doing. With Document AI. We're moving the OCR results, and the fine grained every single token with the way that kind of humans label it typically. And if you see some of the tabular ones, I'm not gonna show it today, almost a year, we can do that with an entire table, we can label an entire column, and then merge it in with the OCR results and get with the model needs to see which is token for token. It's classifying these according to the schema that we gave it. And that's what we're doing here. So at this point, I could train a model. So I've run that apply labels step, I can train a model. And here, it's actually pretty simple, which is where I said, you know, we've got a bill for data scientist, but it doesn't really take a big knowledge in data science to be able to run these, we found internally that we're running the same process, each time, we have four Kaggle grandmasters on the team, we have people that know what they're doing with these models. We're running sensitivity analysis to kind of back some of these things up, and we're finding the results that we see it's not very sensitive. The models we use are kind of models geared for this task. And they're not very sensitive to the hyperparameters. The only thing will probably allow some epochs to go a little higher, if you have a small data set to try to get the most out of something really small, which is what we're showing here. 20 documents, you know, with bigger document sets, you can run it a little faster if you live fewer epochs, but otherwise, like our learning rate, typical parameters, you will get in hydrogenTorch or any other deep learning framework, not very sensitive to some of these. So it's actually pretty straightforward to create a model, which I have done, I'm gonna go walk through quite so much of these, but you can walk through that tutorial to see just the interest of time. So this model and why we had them, and to show you what the output looks like against a different model here where we have the results. So here we have 11 documents. I saw I just did as I clicked on them, we've got this modal popping up, which is similar to all annotation sets, document sets models that I'm going to show you the information about it. The key part here is the accuracy. So I've run a model on medical lab test results, I've got a schema here, which you can see here. So there's three kinds of pains and its accuracy. So I've trained a model, as it says, not too exciting punch a button, and it runs, it's going to run the models we're talking about.

Here's where I react to kind of the output. So the model is run fairly quickly. This ran in two minutes. Now we're running it on 11 documents, 20 pages, that's small, but it's enough to get started, our micro average is already pretty high, our weighted average 96. But the macro is pretty low. Why is that? We're actually looking at a table here and what the model is done quickly, with the initial result it's got, it's got a really good accuracy on some classes, and really poor on other classes. And you can see this with our metrics here, I've got support, support is a count of tokens, and so on the ones that are weak, that's focusing so far where the volume is, that's not going to be a limiting factor, you might think, oh, well, does that mean I have to, you know, I just I need to get more and more. It can't handle unbalanced datasets No, just means we do need a few more than 12 analyses. Now. So far, the model is just going after where it's seeing most of that volume, the repetitive nature behind these, but it's already again, 11 documents, 20 pages, not very much. We think deep learning, typically, we find a lot of examples. But this is the power of transfer learning. So that transfer learning, we're starting from model train and 11 million documents and the structure, this is already able to figure out the structure of some of these documents, which are in tabular format. And so that's a key part here. Now, what we find, you know, in some of the best practices here to look at these and understand what I do about that. So I have a pretty simple understanding what that is, I need to get a little more data so it can see more of the facility name, the line test flag, some of these things that are fairly rare. So with lemon, lemon documents, 20 pages, you need to see more documents who are going to focus on these, maybe it's already accurate enough to start doing what I mainly want, I can think about that, you know, at that point, I might already choose to deploy a model and start to use it a little bit in a trial scenario or maybe even try and get some people these processes are so human labor intensive that a model with these kind of metrics might already be useful. Chances are you want to add a little more, if typically, is you still worth your while to label a few powers of documents, these 11 and 20. Probably these days with our labeler can be done in less than an hour easily. And it just gets faster and faster as you go along with a product setup and thinking about it. So the consistency of the labeling is a big thing we'll talk about when I do a follow-on session later to really talk about the ins and outs of how to get the best. But what we find again, with a high-end team of data scientists have a great team. We don't spend our time modeling that much we spend our time reacting to these and wondering what have we done? Have we created are those labels, we even saw it? In fact, I said myself, I didn't even label a vendor correctly on there. So it's that kind of inconsistency. That is most of what we've seen over multiple years. And that's where we tend to focus on so looking at these model metrics, and wondering, what do I do about that looking at, we can also get those predictions. And interesting, I'm not going to show too much of those, we saw some of the models on the last one, were very accurate at that point. But on this one, we can see these are the predictions, these are inseparable, you'd see some out of sample ones look the same. But this is what the model predicts. And so we can see these we can see where some of the errors are made some of the tools that we have in here, and it just helps you take that path towards getting a better model. There's we said, it's kind of a feedback loop. So I can load more documents in and label those as well. And I have this here, this 20 and 90 is actually different, different dataset. And so now I'm up to 34. And, you know, and a lot more pages with that. So I can take the results of these and look and see is it faster for me to use the existing system to label things.

It was up here. For the computer to label these the model to label them. And I just correct you know, that's usually the inflection point you want to look for when you're starting these tasks. And you will get there, you know, we've seen 10 to one kind of savings, when we put the energy into getting a good model. The next phase, we can do multiple documents. And it's 10 times faster, it depends on how many targets you're going on, sometimes in the structure. But we've got evidence of we've run into these use cases before. And it's been much faster to use the model to start and it's faster to just correct its mistakes, and much faster to get to the labeling. So with that, looking at the time, I will do one last thing, and then we'll turn it over to questions a little bit. So I've got the model here. And that was a quick running model. But typically what you would you iterate, you know, get a new model, see where you're at, it's really easy with those Kubernetes workflows, you don't have to think much about that model, I didn't actually hit the button on train. But it's really easy to fill that out a minute later, you've got a model training, and then you can even get it back to adding more data or just kind of seeing so it's really easy to run a model. And just the ease of doing that is really nice. It just encourages me as a front-end user. I like just queueing these processes, sometimes just out of curiosity, running multiple different OCR sets. But the key piece is when we have that model, and we want to deploy it, what do we do so right now you're gonna go to the Project tab. And we can publish a pipeline, I've got my model here, I can call my pipeline something. This is very new to us, for those that are familiar a little bit. So the custom post processor, we can add these, there's a, we're working on this interface. And we've got some out of the box post process that we can help you with. But this is a key aspect to getting the information that you want in here. So we've got a few things, we'll work on how the documentation is going to track to exactly how to use some of these things. We've got supply chain specifics. And post processors, we've got generic post processors in here. So this is really exciting to let customers drive this we've had to interact with people to help them through the post processors. But post processing is how you take something from those the token classification, if you think what I've been saying this model's job is to take each of those tokens or the pages if you were labeled pages and talk too much about that. But generally the tokens, how do I aggregate those into entities? Do I want to enforce table structure, various different kinds of things you might want to do after that model has tried to look at all those tokens and say is this a token of interest? And if so, what is it and the probability of that. So that's what we can do with the post processors. And we have a post processor hooked up to every pipeline we've ever run, there's always something you want to do. But usually it's just something simple like taking mark and Landry and putting that together as customer name rather than Mark customer name, Landry customer name. So that's the most common post processor, we have table formats, probably half of them, things like that. So custom post processors, but we're going to get a REST API at this point and run with the default. So we can hit publish, I'm not you can see a bunch of these so far, take about two minutes or so. So hit the publish. And then you have a REST API right here that you can use to score. And again, going back to the documentation to talk a little more in detail of how you use that. But an easy to consume REST API, the really simple way it works is document in and JSON out. And so I've been doing things in batch here, we've been seeing, you know that we have 20 documents here, 14 here, 90 pages, all of that. The REST API is single document in and single JSON out which can handle multiple pages. It has a specific schema. Again, with documentation, we can walk you through that, but pretty basic once you get the hang of it. It's going to show you the location of the predictions the class of the predictions, the probabilities that are in there, multiple probabilities even if you want that you can see the second third fourth most likely as well, that's key to some of the workflows that we have customers do.

And so that JSON will come out very easy. It's posted document. So the REST API is easy to use as well. You're posting a document. It's an asynchronous REST API call. But we'll run it back in 1000 seconds or so per page. So and then you can pull to figure out when the job is done, or, and go obtain that payload of the JSON out when it's done. So that's, that's the simple three, same three calls. And so that is going to be also in our documentation as well. So with that a breeze through so we'll probably do some questions. Do you think we'll probably do a follow on to get into the details of some of these things I talked with a little bit more about the OCR, we have to some of the questions and a little bit of more of how to use it? But definitely invite you to bring a document set, follow this process really following along with this documentation we have, I think one thing that's been interesting to me is that you just don't find you know, there's not a Jupyter notebook out there sitting waiting for you from everyone under the sun, like you would see with a lot of tabular datasets, a lot of things that we've done as a company are similar to what other people are doing. Now, of course, there's Document AI is that kind of a burgeoning thing, you'll see lots of the cloud players have there's a lot of startups a lot of investment here, it's a really important task, the solution is taking so much time to do when done by humans. So it's definitely a really big thing that these models are getting good enough, the OCR is getting better. And the modeling the state of the art on both ends that we have here are able to do things that really you couldn't do a few years ago, we've really been able to do things where it's efficient to replace humans, you know, hand in hand, sometimes you put the for the feedback loop in there. But, you know, we find customers just amazed at, you know, five to one savings on the time to run some of these versus humans to do things like you know, just it 1000s of hours saved in these processes. So it's definitely a really interesting space. And we're we have a lot of interest here. And as data scientists, we're going to keep working on making this tool set better and better for you so you can use.

Generative AI

Predictive AI

Industry Solutions

Use Cases

H2O.ai Hospital Occupancy Simulator

Strategic Transformation

View All Case Studies

FINANCIAL SERVICES

TELECOM

HEALTHCARE

ENERGY

FINANCIAL INDUSTRIES

MARKETING

Partners

Resources

Open Source

Join H2O University

Support

Events

H2O.ai Wiki

Responsible AI

Company

What is an AI Cloud?

2024 Gartner® Magic Quadrant™

ON DEMAND

Getting Started with H2O Document AI

3 Main Learning Points

Read Transcript

Why H2O.ai

Products

Resources

Insights