ON DEMAND

H2O Document AI Part 2

As covered in the first session (Getting Started with H2O Document AI), H2O Document AI makes highly accurate models by using a combination of Intelligent Character Recognition (ICR) and Natural Language Processing (NLP) to leverage learning algorithms for optical character recognition (OCR) and document layout recognition. This webinar will provide a deep dive on H2O Document AI, including use cases and a hands-on follow along lab. Prior to this session, we recommend viewing the on-demand recording from the Make with H2O.ai session: Getting Started with H2O Document AI.

3 Main Learning Points

A brief recap of what H2O Document AI is and how it works
H2O Document AI applied to various use cases
A hands-on follow along lab with practical examples

Read Transcript

Everybody, I'm Mark Landry, as I said, and we've done a part one of this, which is available on YouTube. So you can watch that to catch up if I talk over some things, or you're not familiar with it. So it looks like a handful. Have you seen that, but many have not. So we will add this as a brief recap. So the very first thing we'll do is talk about a little bit, so it won't be totally foreign. And but this time, we're gonna go a slightly different direction and talk about some of the use cases that we're seeing. So and that can help everybody whether you're new or not, kind of recognize what these are these document AI use cases document is not like a major AI. Use Case almost in itself like tool of the last while sort of rare, but it's very powerful. And it's gaining a lot of traction, it's important with us, and we're seeing a lot of other people invest in this space. And we're told when we talk to clients, I suppose there's a lot of times when people don't really cognizant they even have a document or use case. And they might lay around if you had this tool, could you do something with it. So we're sort of in that stage versus out of other machine learning techniques that you can read about all over the place. And so this is a little rare, but very important. And the last part is we're gonna spend about half the time on that probably a hands on follow along lab practice example. So But still, if you're not familiar with it, don't have access to it don't tend to follow along, you should still be able to keep up, I'm mostly going to follow our documentation, actually, but just expand on some things we didn't talk about last time. And this time, we've made data source available so that you can follow along. And even if not, now, this will go on YouTube, so you can follow along. Later, you can pause it, you can do some more stuff, we're going to share a small dataset, we've also shared a bigger one. So even if you do fall along with that, you know that we're going to build things that run a little quicker for the purpose of this demo. But it's the same idea. And we'll share both sides of that. So you can learn something bigger and actually try something out. So with that, let's start with the overview. So this is going to be similar to last time, but sometimes good to hear it, it's good for me to change to the way I present this all the time. So moving left to right here documented as about documents, not surprising, but documents. You know, some people think of that as PDFs. But images are a really common thing to do here too. And so you know, camera phones, faxes, as you hear about Ron was talking about when UCSF earlier, we find all sorts of clients that will have kind of low resolution sort of documents, maybe stored in PDF, maybe stored as an image and the format doesn't matter too much. One exception is that sometimes you'll get a nice and clean document saved there digitally, from the start from a webpage or something like that, or your ERP system create something. And there's extra information inside that document, there's a special class of PDF with embedded text we call it. And we'll talk a little bit about how that makes a difference. So start with those documents and run through a pipeline, which in this case, I'm going to cover a few main phases, you'll see them in there, and then I'll go back and talk about them. But we're going to ingest the documents, generally turn them into images, unless it's the embedded text, we need to be able to see them we're going to run intelligent OCR, so that's optical character recognition. So we're gonna recognize the characters from an image. And so one thing you'll see that we will do is paginate some of those PDFs. And then we'll run them through the OCR so that we will use a language model to understand the text. But that we actually use a multi modal model. So different modes of learning text is one, the layout is the other. And so we need to see again, not just what the text is where it exists on the screen. And when we use these models to do our two main tasks, entity extraction, page extraction, the last bit is how there's a little bit post processing, sometimes we'll cover that lightly here. We'll show you what it is. And you can actually use your own demo, if you want. And then the last part of this. So that's your main, the main building blocks of getting a document and pipeline together. And we're going to productionize that with a REST API here as well. So again, ingest documents, run some intelligent OCR to read the characters from them using bit of text, if we haven't, train a model, and use some post processing. Those are main phases of what we're doing. And you'll see that our UI is laid out that way. Also, with the UI, a big difference between this, you know, there's a lot of unsupervised data in a tabular world, but most of the clients do not have labels for document AI. So a big part of what we do with the product you're about to see is this first one here the sole service label labeling, validation and refinement.

Great sorry. So labeling validation refinements so you can get started. So if you if you load documents, we're going to teach a model how to use them. And it works the standard way with machine learning, we need to teach it with the answers. And so you know, we're working on some that where we can provide you pre trained models, and so that that slight exception here, you can take those in as well. But a lot of people start from nothing. And the strength of our tool is actually coming in with a custom schema that perhaps nobody's ever had, or it's just rare enough that we, we've not really seen that before. So you can start by annotating the documents on the screen. So we provide an annotation interface that is fully hooked up with all the parts that we need to get them off trained. And so also, once you have a model, you can refine you can you can predict against new data. And the predictions themselves can be used as new targets, to a model to expand. And you know, it may take an adjustment, if you're getting a model off the ground, and they make some errors, you can correct those errors, add it to the dataset, and therefore incrementally increasing your dataset as you go forth. And so that's the powerful thing. It's a big part of the tool we built, that processes the workflows, but also allows you to visually see the documents interact with the documents, both creating labels to start refining labels, as you have a mature pipeline, or as you're growing your model, and growing that dataset to be better and better the model stronger and stronger. So you're gonna see the training use user interface, we'll talk about quite a lot today. And the integrations will touch on that at the very end. So when we get a model, we like a pipeline, we like using the OCR, using all those phases I talked about, we're going to deploy that pipeline. And you'll get something that's a REST API that can score and can scale. And so this our pipeline is built for Kubernetes. It's built to scale build to add more nodes as you need, you see fit with your use case, whatever the volume is, to put it out. So it's just a matter of adding more nodes. And then at the end, human loop jumped out a little bit, but the review and a correction. So this is sort of a phase of interacting with our tool, we've set it up so that you can interact with those predictions, you can you can review those, you can correct those, we're going to get stronger user review interface, business and facing user interface. But you can you can continue to add to your data set here. So this is about what it looks like this is what most people want. So I breezed over those terms. Go back a little bit. In the middle entity extraction page classification, it was the main two, there's a couple other things you can do with that committee either. Those are the main two things, people do entity extraction, as called a lot of different things name to date.

Recognition is a popular one that's usually generic, we're usually going for custom. But the same principle applies in any er, page classification is just if you look at a single page of that document, whether it's an image or PDF, what is it and so people classifying different levels of documents, what processing in different ways trying to separate things, enabling workflows, and then your entity is often your entity extraction runs at the end, all clients we've had so far run entity extraction, to ask about half of them also run a page classification task, they still want to separate the pages, they're not sure what's coming in. So you can do those two things in the tool. Here we're looking at an example of a trivial example of what Bob Rogers was talking about with UCSF they get these Medical Referrals. And this is actually the example we'll see, we have some trivial examples you can play with, but they're of this type. And so we're going after specific information on this form on the left. And this interface on the right is one this is sort of designed to show you what we're going for and process the full thing. So we've gotten, we walked down the right side, the actual class is the part on the bottom right of each of these patient DOB patient name, referral date, referral reason refer clinic, I there's a few more you'll see in this data set that we have. But things like that are what we're going after for this page. And this is important this is this is custom. This is usually up to the user, what you have, you even have common supply keychain cases. And often as we've grown that over time, we have a common set of things that you want from a supply chain document, but certain customers want one or two extra things. So the schema is custom for what you need out of it. And it's what you want to do with it at the end of the day usually dictates what you want to pull out of this. Some people you know already have processes setup, and this is the way to replace that. Either way you want to think about what you're getting off of here and pull those out. So here we see outpatient psychiatric clinic is pulled out. It looks like this is referred clinic but we've got different clinics here. We've got the date of referrals pulled out. And this we'll get to again in the demo but the way that we handle this with document AI, we're not looking at the surrounding context, our models are going to see all the data here so I talked about being a natural language processing model, but the natural language processing model It uses all the context of the screen to deliver the specific output you want. Now, in this model, it contains a little bit of that. So it's a little difficult to see sometimes we have Date of Birth pulling in, we have name pulling. And so I was very confident, generally clients don't want and this is something this is actually I think this is an artifact of the way that the model, we inferred our labels from the start. But usually, you can go after directly the data that you want, so that that June 24 2015, can be parsed into an ISO standard date sent on its way and just used as in a database or any other kind of process that you need. So here, again, we're taking specific extracting specific pieces out of this document that you want for your use case, pretty much as simple as that we'll talk about some of the use cases right now. So these are some of the use cases that we have come up against, I think, last time said, you know, you can do about anything with these with these processes, and you kind of can with documents. So we're just sharing a few of these, again, because once you think of what it can do, especially as we move down to the bottom of this list, so there's a few different key aspects of how this is laid out. So things that you might not have thought of. So if you're familiar with these, if you've already been working with OCR, you've perhaps wrestled with a spacey model with again, or something like that, you're just trying to get data out, something like that you're familiar with this, or all the way up to what document AI other solutions are doing, our solution is doing.

You can kind of process anything with a document, think of what you need with it. And we're going to show you some of the tools so that you can get there specific to what you need again, so custom named entity extraction, you can think of it that way. Token classification is what we call that, that's sort of where it is. So the top is organized into a key component of use cases. So a lot of these are validation doesn't have to be. But validation is a generic term I'm using here is we've got a document. And we're trying to validate that the content from that document matches what we have in an external system, often a database or some registered system coming in or could be an extract that something sent. And you're trying to see this. In fact, that's sort of what this hardware line was air waybill design is that you've already digitally got an electronic version of that. But for some reason, there's a piece of paper that accompanies that actual freight. And you want to see that that piece of paper aligns with what your electronic record is. And that's what most of these, most of these are. So the bank ID documents is the same sort of things. Like I know your customer initiative, you've registered the customer's account. And many companies around the world require that, that you understand you ensure that that is your customer. And so the ID documents like a driver's license, or passport and things like that are often required to ensure the identity of that bank account holder. And you can see here, we've got six, in fact, this plus is there's two people using these invoices in this way, all in a validation setting, and even partial validation. There's a bit of a mix of whether it's the only source of truth for that document, or the validation. And so here prove a negative test and a bank statement. Some pieces of information were no some weren't. And digitization that's just a different category of your this is the only source of truth and actually a physician referral forms that we'll look at today, which are inspired by the real use case. Again, all right, was talking about with UCSF, so important use case for us. So these physician referral forms, there isn't anything known about the referral form, when it comes in, there's no secondary source, you could already maybe there's been validation. Sometimes they'll know the patient, but often they won't. And even if they do, it's not a direct link, necessarily. It's just that they have an inventory of potential patients. And the same with healthcare. Healthcare is the industry, it's a purchase order for supply chain purchase order here. And then judgments are really interesting one, sitting on its own had sort of flavor of validation, as well, in fact, but it's putting its own category, because this is an interesting one to us, again, that as we get familiar with this, thinking that it can do named entity extraction, classification. I guess this is a part two, if you've seen you know what this can do. And sometimes, even I guess I and probably others, like almost pigeonhole it in a bucket that it's just doing that sort of thing. Just tell me just get stuff off of that form that I can sort of see for myself, you know, humans do these tasks actually very well. So a lot of times we're trying to scale with a system that can replace humans and let them do something a little more fun than all toggling back and forth between PDF and the system that they're trying to check with all these validation use cases. But we are using a strong natural language processing model at the end of the day. And sometimes it can almost be surprising what that can achieve. And so here are your more traditional NLP use cases that they can do. So in this use case at the bottom is judgment, the statement understanding like I speak too much about specifically, but we're seeing a large document 60 to 100 pages on average. And we're looking for specific types of the way that some of the information was recorded on that document. And there was a specific, you know, there was a use case that depended on two different types, the way it was described, could be of two different flavors. And people can understand those reasonably well. We annotated those, what they were going after, but it was really, it wasn't even still perfect. But

I actually suggested at first, maybe we try just extracting the fees, and then later using a separate stronger model to try to divide them into a and b, and the customer went for the doing both at one time anyway. And it was very successful, we had, you know, there are f1, our recall and precision were both in the 90s. So I guess it didn't blow me away, as almost just show me that these are powerful models, or remind me that we can, we don't just have to extract data from there. That's an important part of it. That's where we see most of our use cases. But we are using a very strong model that can achieve more than that if it sees these documents, if we can frame the problem in the way that we're doing. And we're able to do that for this judgment use case, you know, we'll add more problem types to document as we go forth. But most of the thing, most of the things we're doing the page classification, token classification are what gets the job done, and it wasn't able to for that one. So these use cases, it can be documented, I'll kind of even go a little off this list a little bit to talk about some of the things that we see that once you get the concept of what is possible, certain things just come up that maybe a little smaller in scope than you might think. So incumbent processes are very important. In fact, let me I'm going to move over here just a little bit to get this on the screen is one of the questions we ask people when we do go look at a potential use case, is it new, or is an existing use case? You know, we have existing use case, you can really understand where things are at. Usually there's people involved. Usually your goal is to reduce the number of people. So let the tool do all the simple stuff, and maybe have the humans waiting for the tricky stuff in there. A lot of these validation use cases, they wouldn't be done if they're sometimes wouldn't validate, there's usually not a reason to, you know, there's usually some discrepancies in there. And so when humans focus on those discrepancies pointing in the right direction, doing the job for them, when it's simple, these are things that are all coming to these use cases. And it's an evaluation for every use case is your tolerance for, for all these things. And it's different, there's no way to look at these one way from another. So if I'm back up to where I was trying to try to go with this, but we looked at some bank documents, and once someone kind of understood what was going on, they realized there's just some, there's something kind of simple that they had not seen, it wasn't necessarily a super high value task. It wasn't something that humans had done, because the scale of their documents was huge, but they realize it was it was just the hours of business. And they had an awful lot of templates, and can we use document AI to find those and show us where they exist, point us to there. So we can review the language as we have kind of some of the shock events that we've had in the past couple of years, with COVID and supply chain issues and how that affects things. So it's kind of a simple task. And in fact, Document AI would do very well in that situation, you know, a lot of times those hours are portrayed very consistently. And so it would not take much training data, most likely to get that off the ground. But it just gets your mind going. And so that's essentially where some of these back came up with vectors. So looking at these lists, you know, we have a variety of industries, we have a variety, documents, validation, and other this kind of a use case type with the actual document itself. Just be thinking of it can be anything can be digital documents that are just sitting out there, which sometimes are used as source of the truth. It can be surprising sometimes. In fact, with these statement understanding of several these documents, you know, a digital copy exists somewhere. But yet, these documents for some of these customers are used as the source of truth. And that's very interesting. So you might almost I've had a tendency to do this again, overthink, why is this necessary? They must have a digital copy of this already. And sometimes the answer is no. And so a pile of documents stored up over time, as a source of truth, perhaps a database gets migrated and moved around VoIP systems or combined and things like that. Perhaps it's just easier there or perhaps you don't have access to so the on these physician referral forms as an incoming document, as we'll talk about they don't control. It's the variety is high because everything coming in to UCSF from all different clinics is often on the template of the referring clinic that the one that comes in and so they don't get to enforce you uniform standard, so you're not working with a single template. And you've really got to generalize your understanding of what you want from it from that document. And that's what that's about carrying out. So again, let's take note of one view and a potential use case, is it new or existing? There's no right or wrong answer to this. But it's usually an important facet of trying to figure out how these document how we're gonna perceive the document. In this case,

having experience just helps in general, because people know the documents they know where they're from, you can evaluate them, you can see or at least understand a lot about the document quality that's coming in. And most importantly, how is that that problem handled as we're looking at the second bullet point, the use case type, is it validation versus single source of truth, we saw some other use case types I've broken them into. But these are two really big ones. So the validation use case is definitely interesting. Because it, it bounds the problem a little bit, and people are already doing it. And so you have to think of how can we help it not necessarily how people are already doing this one, maybe it's a validation, you haven't been able to carry out. And now you can, maybe you can do at a speed that wasn't possible before. So there's different ways. So you can still be new. But the validation use case, you can have some different steps, you can take one, you have a prediction, and you can bounce it up against another one. Most importantly, you have a sense of whether it's right, that's a little stronger than the model is confidence itself, which we'd like to calibrate that as much as we can, the stronger the models that get these calibrations will be coming out. But no matter what you do is anyone in machine learning will know, you know, no matter how confident you are, mistakes will happen. You know, this tool is not perfect, just like any machine learning model. Even humans aren't perfect in a lot of these use cases. And so that's really kind of what it gets is it's hard to illustrate this. But you know, what we look at is the tolerance for errors, I understand the potential errors there. So with an existing use case, you sort of have a framework set up with that, and how is this going to work? How are we going to validate this? How can we be sure that the model is correct? Well, in validation, we can have one extra step there. So that's that that really helps us get off the ground, sometimes it helps you focus especially so a lot of ways that human will still touch it will look at it. And we're looking for here is to speed them up through that process, the return on investment is usually time saved. And so how can we get if they're spending something like 30 seconds per document, can we get it down to where on a clean document, it's only two or three. And if it's a difficult document, it's maybe that same 30 seconds, and it can be the same even on the harder ones, because we can point them straight to where we were close. If we are in some cases, so if something doesn't validate, especially on something like some of the audit use cases, the natural documents did not match almost 20% of the time. And we can show exactly where that that figure is rather than going through pages and pages and pages and lining up accounts, you know, the tool is already done that it's lined up an account, it's found a balance. So whether it's right, or the document itself is wrong, either one looks the same to human, if you're validating that, and you want to check that. And so the tool, knowing where everything is in the document can also speed people up. And so there was 6200 page documents in that last use case there for judgment. That was a big part of the ROI on that. One is that we can sift through that that large document and there's not rhyme or reason to that specific document. Sometimes it is you know, it's early or you know, it's at a certain section, you can be quick with that. So you do want to take note of that don't underestimate human processes well. But in that case, it was much faster, because we were able to deliver the ranked predictions of where we see these and you should the first case is going to point in the right direction. You can see others if they existed, especially if they're not in just one section of the document. So the machine doesn't get tired, it's going to review every single page. If you want to get some people asking if we can only do a couple left, it's going to look at everything and show you there's some on page 27 And some page 53. And a lot of times people might stop at 27. And they'll go through 53 Because they think they seem they want. So all these are kind of reasons where you know, we can succeed, we can speed things up, we can do things that humans don't do either Well, or they don't do fast. Either one is a good opportunity for documenting I use case. So last bit document last couple of both the same kind of thing, really document availability and variety. You know, we're going to play around with a document set that has five documents for speed reasons, we got 100 in that we'll send you if you want to play with it later. But document availability, if you're if you if you don't have access to documents, it's going to be very tricky to work in that use case unless there's already a model that exists. And so we have some supply chain models to start because we've seen enough of those use cases. There are occasionally some out there you don't find a high prevalence of like hugging face models that you can just use for what we're doing here. But occasionally they do exist and that's the way document is moving. As we see more and more we'll see more and more pre trained models, but the majority of our customers you know the supply chain use case the different aspects of where they're at The supply chain, some of those are the one of those elicits contracts very different

from invoices and receipt. So even with familiar, you know, documents and processes, document types, we still have slight differences with our customers and what specifically they want to extract or how they get it, you know, certain models may work or not work for them. So, so we'll see both sides coming up, we'll get better and better retrain models as we go. But people looking for custom things is sometimes a pre trained model may start them off in the right area, or there may just not be what we did not have access to a pre trained model for physician referrals. But they had been using a template process from the start, they brought in a lot of data extra 250,000. So train pages, I think we even did that later on a second round with more data, more data, but a lot of data to get that going. So your experience may vary. So document availability and variety. So is it just a few, then that's going to look like a template nice is going to train quickly. And we can handle those two, you'll see, as I talked about last time, you know, one of our differentiating factors, as we've talked about is this custom, many argue talking about the custom schema that people want. So with another really popular Avenue with documenting AI is, is working on templates. And so a lot of the vendors out there just aiming specifically a template, if you bring us the same document, every single time it moved, the content changes, maybe occasionally, you get a little bit of variety with how that form looks. But it's generally the same template all the time. Your strategies are different, we're using something that that goes for goes for the variety that a template method generally cannot handle. So we are backing it, we have some template methods as well. So that we can succeed at an even higher level for some of those, but we still work it through as a custom problem each time. So the document availability, you know, can we get access to those can be trained those can annotate those, and the variety just to be understand and use ESS referral case, they don't control it probably likely 1000 different cases, there's a lot of provider groups around the country, we saw this with the audit use cases with the banks, you get a typical Pareto type principle not quite at 20. But you know something where but 1000s of banks out there and every single bank statement you would get or audit and confirmation would look differently. So last one available targets. Are there any usually there are not most of ours, they are not. So that's again, I will focus on that with what we've with the tool. But if there are some if there has been an incumbent template process or something like that someone's got some business rule process that's been trying to get close. And it's been refined by hand over time, you may find yourself that there are targets available. So that's a question most of the time for us, they are not. So these are some of the things to think about. You can put other things I have covered with this in the chat. We should have a few minutes for chat last time we did this on the session one sec quite new to video, but there was a lot of good questions. So I do intend to, to answer some questions in chat. If I have anything else you want to ask about. Would this be a good case for document? Would this be a good case for with h2o document AI is the tool so you'll see here document as an overall industry but still new and being defined in different ways and people have different takes on it. But I think everyone would probably say that this question isn't just OCR. It isn't. So if I processed and to get one on the screen, it should show someone a backup. Okay, so if we have this referral form over here, isn't it just OCR? If it just OCR would be taking all the tokens, you know, first of all, we're reading things left to write in an order that's not straightforward. We're going to say fax completed form to data referral June 24. And then we're going to get to where the fax completed form went outpatient psychiatric clinic, if it's just OCR, especially the standard OCR that doesn't even use the pixels, just I OCR and I copy paste, I put it somewhere I work with this as text. It can be done. But you know, you need something to go through that. So what people think really mean isn't it just OCR and business rules, that's probably you have to at least that's something that because it's not just a car, or you couldn't figure out anything about this document really you just have the text. So at its literal is it just OCR, you just have the text, you pile on that where we mostly see people trying, especially those that have tried and failed? Well, I tried some business rules that seemed easy. I tried again, like I said, like named after your admission, like a spacey model to find the date. But there's multiple dates on this form. I've got November 11, 1977. Here the date of birth, I've got June 24, 2015. The data referral you know, there's probably other dates on here we commonly will get multiple

named entities the way we typically think of them from traditional NLP models, if we're just classifying things as a type of person pronouns, things like that place, phone numbers, you can usually find this out from existing models that have been there for a long time. You know, dates especially, it's a pretty easy one to go after. Um, it’s not as easy to think when you write that code, sometimes to be really precise about it. But that doesn't tell us which date is which. And so you can try business rules that go hunting around left or right, looking at things, but there's certain things that a lot of times the form, especially when you look at things from the header in the form, like an invoice number, something like that doesn't even have any context. Sometimes it's just a number sitting out there in the open. And we all sort of understand when we work with these documents, that's the invoice number, okay, it's fine as the ID number for the document, or here, this is the clinic who's sending the referral, because that's who filled it out. So these sorts of things, this is where the model comes in, is trying to separate these with many candidate things in here, if you try to write these business rules are very difficult. If you try to write models that are not custom for this task, you can usually get to a certain threshold. And sometimes maybe that threshold is good enough to get started. But the models here really are what are allowed to surpass that. And I say that and the first talk, we're talking about the first part of this make session, talked about that a little bit. We use the robin model for a while, you know, we were new to this space, when it was brought to us a few years back, we've got experience with this. And we tried that way, not necessarily just OCR, we knew there was a modeling behind it. But we put models that weren't customed to this task. And so that brought our ceiling higher than is it just OCR business rules. But to use the right models was necessary for us for our use case. And so it's not guaranteed its but you know, we have the models that can do a better job. And so that's what we're putting in place. For that, let's flip over to the demo. Here we go. Okay, let me get to one more slide here skip to slide setting this up. So for those following along, or who will follow along later, which I've gathered is this most people, but here's what we got. So for login, if those are following along, and you beat me, we'll be ready with this. So with the free trial, you can go here for free child all day show products, pretty simple form. That's how you can get access to our demo cloud. And so you can try this for yourself. So that can be accessed through this link right here. So document AI Cloud dot h2o dot AI, if you get regular cloud, you're familiar with that, if you're familiar with driver license outside of driverless AI, this is a bit different. So that link right there will get you straight to document AI. So being cautious of how we push that out there. And as we gain more and more, it's still an evolving product. And I've talked about that last time. So the product is new to us, this user interface we're going to walk through, really started taking shape, just a spring of this year, we've been having the library behind it for a year before that, and arguably of doing these use cases for another two years before that. So that's sort of chronology here. So we're still iterating, our front end, and actually we'll have one of the next releases will have a different look and feel to it. Similar idea, we're all but clean some of the things up in there, the documentation is really key. So I'm going to follow along with that, I'm not going to go back and bounce and forth, back and forth between it, but it walks you through the same thing I'm walking through here, and it has and that's the way to get started. So but I'll add some color to it, you can watch it in action, you can try it yourself, of course. But if you get lost, and I'm not covering something, our documentation will, will cover that as well. And so there you have the link straight to that I'm sure there's other ways to find that, of course, the data, that's a very big part here, too. So documents annotations available. So there's a Google Drive link there should be publicly available, midterm battle tested that but should make accessible. Two key documents we're going to use for this. So the document set and annotation set, we'll get into all that. But you got two key files that drive this. After this, I'm going to also put a bigger dataset that you can also use, which is just the same thing, just more of them. And then the key pieces, I talked about what we're going to go through in here, we're going to load those documents. So you can see what you can do from there, the OCR branches off from those documents, we'll see that in action, work with the annotations, this, as I said last time is one of the key parts of working with documents as data.

And so then we're going to also train a model. That's kind of obvious, it'll be it won't be the greatest model you can ever do. That's not very good. But you'll see it will start to learn on just I'm giving you five documents. And so we did that too, for people to try to keep up for me to do it in time. So I'm gonna make sure I get on with it don't even I won't finish the bed. So deploy pipeline here. So that's why we're gonna give you the 100 documents as well. So you can actually train a model that, that you can see really showing up about as good as what you saw on the earlier screens, their deployment pipelines really important part of this. And so the REST API, this is the way to consume. We spend most of our time I suppose as a data scientist, you might think it's these three steps here, over and over and over add documents, grow the model, add annotations Hands keep making it stronger and stronger and stronger until it's, you know, sufficient as you like it or sufficient to deploy and continue to grow that over your use case. But at the end of the day, you want to expose that generally through this REST API. So that further ado, let me get into that. Okay, so here is documenting i, this I'm using a cloud internal, I've got several different clouds, we have several near issue. This is the one we're advocating for. But I'm going to use this one for now. So you log in, it's gonna take you to a screen quite like this, except for you won't have any projects, so your space is your own. And so the first thing to do is going to be create a new project. So this is the screen that you will have on yours. So creating a project, really just a container, you can judge that how you want, we usually stick use cases and the same thing. But you can change these, you can rename them, you can use them later on. You can delete them move them, you can train more than one model in them. So it doesn't have to be a single model either. So it's really just a usually a collection of documents that are used as kind of similar. So we sort of organize these by document type. So this keyboards a little loud, you can, I usually don't, as I said, last time drop files in here, I usually go one more step, but almost everybody wants to. So let's go ahead and do that. So I've got these Medical Referrals. So I'm uploading my documents, I just want to get started, a lot of people have one document, ready or just something ready to start. That's a key point here, I'm gonna click this, and I'm gonna talk to you about what to say. So here's the larger example where to go for help. But here's example referrals. 6.5. Meg's you got just five things in here. It's small. This is important, though. So file supported PDF, images, zip, what we do need here, we have a new way, one that won't have the same limitations. But right now we need a single file. And so it doesn't have to be a single PDF or image, it's a zip file. And traditionally it is, so you can get multiple documents in one shot. However, also, that zip file needs to be clean zip of just the documents, you have a new version, it's releasing that constraint from us. But right now you need to load and the one you've got to prepare for you does this so effectively control a your documents and zip them out with a command line, something like that. What I mean to say here is right now, don't zip a folder, that's a natural thing to do or working around this, but we need to see a zip file of just the documents. So I'm gonna create my project. So here, this will be empty for you and a few others. So I can show you some things to maybe not have time for see the time comes to time. So I have a couple tests ones here. So I think make training is the one I just use. So here we see these same kind of principles that I've been talking about. So I don't want to belabor this too much. But we've got document sets, this one's processing, you'll notice processing for a few reasons. So status, it's got a timer, this will move to a check when it's done, or exclamation mark hasn't gone well with this question, we found a number of documents at zero and pages. That's not true. But that's our way it we just get to learn. When you see zero and zero there or blank, that means it's still processing as well. So several clues that it's still going, this is accurate. And you just it just switched. And actually everything's switched. So my last modified date switched. And I've got a check mark. Now, I can in all of the user interfaces here. Now, you can click on this, and you'll get the same modal screen here. Mainly just take it to your logs, I didn't even bother typing the description in, you can change the name of some things here and delete. You can you can do most actions. From that page, I myself generally click on these and click the buttons at the top, there's not much you can do with the document sets. But let me show you with the annotation sets for CSP more actions on the top. Again, the same buttons are going to be accessible through here. So whichever one you prefer, I think most of our internal people prefer it that way by clicking on these one by one. So let's do that.

From the document sets, there's really only one thing to do, you'll move into most of the time spent on the annotation set. So let's get on with it so I can make sure again, so far too late in time. So here's the OCR methods we have available. So I'm going to take those documents execute the OCR, I'm not even looking at the documents, we're going to back up and show that but let me get kick this job off a little bit. So with the result names, so and I'm gonna run the test. Here, you can run anything and do that it's a little faster. But we have.tr I'll talk about those in a sec as well. So example tokens, Tesseract, I've chosen my OCR method. I'm going to click the second type descriptions, click the OCR button that is going to launch a job so this is a little back and forth so we can get everything in a row some take these off and get going. So on the jobs right now, you got to the first thing I did was upload it I could have checked the job as well it was processing. Now I'm doing exactly that. So you see a job type OCR. Most of these will be transparent sometimes they will just post process documents that does a few things. So you can you know, based on your name, and you'll see all the clues that it's up here, right now we are OCR in, this job is lying. So you can actually see it's actually completed. So it's probably about done around, there's a little bit of wrap up on the Kubernetes job. And even Fortunately, I'm moving this forward. So I'm not talking about that too much. If I click on it again, it's still pending waiting to finish up. So what I did when I open that question, let me get one more thing, I know, I need to get ready, so I don't lose time. So I've given two things. I've loaded the documents, the other two, the other portion is an annotation set. So this is where we have answers. So we're providing one for you battered sites,

40:40

the original. There we go. And important, this will take long, but at least gets gone. So we've provided some targets. So we don't have to click there on the screen and label even five documents with just a handful of classes. It doesn't take long, it takes long enough. So we've loaded those targets. Now let's check the jobs. Now we've got three. So this is important. So this is running now it won't take very long, it's your job didn't take very long to 52 seconds, you can see that there's a couple of things to note in here, if you're really paying attention. So you'll see that the workload started. And at the end, we're going to have workloads finished somewhere in here. Yeah, this is some of the Kubernetes finishing up. So there's, there's not pulling it out right here we can see we have some timings in here. And then you also have the created date and started date in the event that you have GPUs in your cluster. And sometimes we're careful with those. And so if it's been a while since you've used the GPU will have these are Kubernetes. This is designed to work with Kubernetes. So we may queue up a new worker that we like pause. So it can be a minute or two or three even sometimes to go turn on a new worker to get that going. So we're going to try and keep the clavicle down that way. So if it stayed hot, so if we run it again, 20 minutes later, it's going to be ready to go. I just say that because you may see a difference in the created date started dates, here and the log timings they all match right here because it's not this one's ready to go. So just a slight Note that you can see that there's a couple of components of getting this run. So we're, but the good thing is we can run multiple jobs. Everything again, is designed for Kubernetes. It's designed to work on GPU, I ran a CPU model, the Tesseract, but doc TR and the OCR sets again, in case I get squeezed for time that there's a better explanation of what those OCR methods are in the original video, but let me show you in the part one of this sorry. And I think we've got we've got time notations called out in there too, so you can skip to it. But really quickly, this blue also describes what's going on here. And we're going to add to these so if we were bubbling up the engineering behind this, so we're going to add paddle MCR paddle, really strong line vert library, I do gets us has some really strong pre trained models. So we are making changes to these OCR methods. We've refined the doc TR models since August, I think since we did this last best works the same way. So let me just describe that for practical purposes. It's really you have three different single choices here to or OCR PDF text extraction is just getting them out of the PDF best is a way that dynamically tries the PDF to see if it's embedded text. A lot of times when people get embedded text, they're not sure if it's in any given file, particularly production could be embedded have embedded text in there. So we want to be the fastest way possible, error prone error free way of using the embedded text and only use OCR if we need to OCR is computer vision models. So optical character recognition Yocto. We're using computer vision models to do this test practice slightly different of the way it processes really good off the shelf library, we'll have time to talk about but doc TR is more standard computer vision for those of you that are familiar with convolutional neural nets. And it's not just a simple classification. For the OCR process, we do a two phase for the three check rotation. The main process then starts with finding all the tokens on the page tokens a space delimited you can think of as a word but because the ID numbers and things like that are not really words. So it's faced a limited set of text, whether it's one character or multiple characters, we're not going to use it every single character for themselves and maybe differ from how you're familiar with OCR. So when we find all those boxes on the screen, then we send that to what people think of as OCR, which is recognizing what is that token there. So let's put some on the screen. So you have those choices. You can play around with those, you can try some I actually have fun with that sometimes trying to different engines and seeing what happens. Seeing the differences here and there. What it will look like is this. So we have lots of boxes. We're trying to read everything on the screen, and we see what the value is. So that first pass algorithm wrote that had found all these boxes. And even though this looks really clean, it's probably started from PDF actually sent it an image. So I supplied all your work with images. And so we do have to use our Character Recognition here even though it's very clean, it looks really most of this is going to be fairly perfect. It's a pretty clean document. But we have actually had to invoke the OCR to run this. Plus the fact that I actually called on Tesseract anyway, but if we hadn't, I would have done that. So

this is what we get from this er, for practical purposes, there isn't much to do hear, maybe observe how well the OCR works. And this can be part of checking out a new use case is something I do a lot fire some documents in judge the OCR quality here, you know, you can get very specific, if you press the spacebar, you can actually see all the text all in one. So it's a little cluttered on the screen, you can just kind of glance through here and see, is it getting things mostly right do seem pretty sane, they look very good. From here, you can hunt and see what it is. And that's again, your mileage may vary on your use case, do you need the OCR to be really close to perfect? If so be evaluating that be really careful. And looking at this, if you do not if you can handle a few errors here and there, or perhaps even a lot and it's surprising sometimes, you know, if we have handwriting and a blurry document, we do a good job. But nobody's perfect on that you can benchmark, you know, state of the art on these, you know, you just have to go in understanding, it's not going to be perfect coming out. So that's about that's the main thing to do as an action as a premium trying to evaluate a use case, when a data scientist you're not just looking through here and understanding, does this look pretty good. Because you know, I can I can update something, I don't really have anything to update. But if I thought anything was wrong, you know, I can go in and change it. And I can save that, if I want to this isn't a live annotator, I can do whatever I want. If I thought this box didn't need to exist, I can delete it. I can do all sorts of things with this. And sometimes it's practical to try and help the model a little bit more if there's a lot of maybe a high mistake rate. But you have to be thinking that when you deploy that pipeline, these are going to be the Indicative errors that you see. And so you kind of have to deal with those and see all one bundle, what comes out at the end, how accurate is that? Is that accurate? For my use case? Are we saving time? You know, that's generally what you're thinking. So, really quickly, I've given you these annotations, you have them yourself. And I didn't talk about this much. So again, there's two ways to do everything. So just about with the annotations. Now I'm going to click on it, edit and page you that takes you to the interface that I show here, this is our annotation. It's also just the way to display things. So like I said, looking through everything, so it's live, you know, everything's moving around. But that's fine, too. So I can click on this box, I can extend it, I can do all the things you might think of with a live annotation engine, this is a good one where they're active team play around with several. And this is a really good one for documents. Because we find little characteristics matter as h2o. We're looking at the annotation side of general tools like well, Genie, something you may have heard of. But you know, for this one, this is customized for document AI the task we're always doing, we're zooming into documents a lot, really fine detail. small boxes, we're carrying multiple pieces of information a lot of times not in this one. So I've got my label here. But a lot of times, we'll pick up multiple things that we're talking about with a document at the same time. So eventually we'll get the text and the label. So here I've switched last time, when I clicked on something you'd see with the OCR text was with these annotations, I've skipped that I'm not trying to label every single box, I'm trying to label just what I want. And I'm going to pick a label here. So this has been set up. Our time I'm gonna do a little quickly on this one, I apologize. So again, our documentation will take you through this, glanced at it real quick, I've got it up. So you can follow along with it, I find that the beginner tutorial is the best place that's sort of aimed at that these are going to be the same documents we've been looking at. So let's flip back here. So we've got the annotations. The one thing it was easiest, because I've given to it, you don't have to create a project from scratch, I'm going to show you how that can be done really fast. By going back to where we started when we when I imported the documents. At first, we pre set up sort of a dummy you can think of it's all the documents that's useful for just looking at the documents sometimes that's interesting, what was even in a zip file, looking through them, understanding them, thinking about how you wanted to strategize the use case. But we've created this one and it's empty. And I'm not going to use it because I've already imported some, but if you didn't you start with something that's empty. I know that from the attributes text is direct results of the OCR label and class are two different class supports page classification. You'll have one label per page label is what we've been looking at almost all the time. That's token classification. And so multiple boxes on the screen. You can have just one if you want, but it's Structural is a little different. So if I were to start from scratch, what would I do? I go to Edit Page View, I'd see the same documents. And I would start down here in the Attributes section. So file attributes could be used for class, I typed it in, I can type anything I want. But we're working on making sure we mistake proof here, class, all lowercase, label, all lowercase are the right things to drive. The model the model expecting class for page classification, I can add it here, I can change what I want, I usually use drop down a lot of people like radio. So doctype. One, I can put a description.

Type Two is here. So I've been using the one where we would just label boxes, that's a region attribute, where we click on boxes all over the screen. The reason I purposely use this to here to show how quick that was, the setup is already ready to go. The way we'll access this entire page with a single shot is to hit the spacebar and go to the File invitations. And you'll see it's these two things I just typed in, as you change this list, they will become available immediately. So you can just change these and move through document after document. This is one you can see, this is one I click off by pressing the right button here, document two and so forth, you can quickly annotate a lot of a lot of documents this way, if you're just trying to say what am I almost a classification in a normal image models. So we've set this up, we've given you these labels there. And to train a model, that's usually where we want to see things. So we would do one task, first, we would apply the labels, this is something where we so we've been working separate, if you will, and it is faster usually to do that is to start when we get this going first. Play labels shouldn't take too long. So we've got the OCR here sitting in this annotation set, and a different annotation set stores our labels, what we're going to do is essentially merge them together, we're going to run a union of where all those boxes were on the screen here. With all the tokens that are underneath it, you don't have to work that way you could label straight from the OCR results. But by doing it this way, it's usually faster, it's up because we only need a little bit of the screen. So there's several different ways to do this. But that's the way I prefer and most of our most labelers prefer to do that. What it means is you're sort of trusting that the OCR will be underneath it, and will overlap those then. So that's another thing you can look out for when you see the OCR results, by here is the fusion of both of those. So here's the other way that you can go to the page view. So now, what I'm doing is I'm carrying all the elements, I've got the label, I've got text and some of this internal stuff that we have. And that was, if I hit the up button, I can see, here's my classes, if I hit down, it changes color. And so that uses it was already on the one that makes the most sense. Both of those are good clues to what you're doing. B takes the boxes off, l takes the labels off, and you kind of need to do all those at certain points. So it's nice to be flexible, and change things dynamically that and quickly. So we can see we've got these views that this is the output we want. This is the outpatient clinic for clinic refer clean. Oh, sorry, it says cannery. So the outpatient psychiatric clinic, three different tokens all classify with the same thing. I imagine if I went back and looked, it was all contained the same box to start with, but now we've merged it with the neural tokens, because that's what the model expects, the model wants to see all these tokens individual can think of as almost a row per token. That's how the model sees it. And it what it has in there is this label, and also the coordinates of the box. So we're not showing the coordinates. But essentially, you can implicitly see that what the coordinates of these boxes are. That's what the model we use this label and model laid out language model and talked about again this morning, part one. This model is working from all those context clues. So it has been pre trained on 11 million documents, it understands the structure of documents. It's a natural language model, it's been pre trained with a Bert style for those of you familiar with that. It's a replicated Bert algorithm set to documents. So it's gonna, it's going to arbitrarily roll that hide tokens and start to learn the structure based on what's the next token. That's the way this has been pre trained. And we fine tune that to do the tasks we want, which is to learn that these three words are referred click, or if we did a page classification, look at all the tokens classify what is this page, this is an actual referral versus a cover page. clinical notes all sorts of things can be bundled in with common PDFs are different types. And so some people want to divide those out as well. You don't have to for the job to if this model can be trained to see other types of data and avoid those Also, that's a question we often get, you can do either one, it may be more efficient sometimes to run to route those models. So you don't have to wire them all up. If you get multiple models inside one document, you can route them. But actually, a single model will know to avoid the cover page and to avoid the lab notes if it's got enough data. So here again, we're going to try to train this year,

I'm going to kick this off, and then I don't have to queue for time. So with the train model, we've got a token labeling model type, this auto would filter, I haven't

talked to what we have available. So ABC is the only one that's capable. So it looks through these four datasets, I clicked on it. But even if I hadn't, it would know that this is the only thing available. So soft. prediction and taste and set. This is sort of a cue a model, do we want to validate it on data, I can split this, there's only five documents here, you can split ahead of time, we're working on trying to fuse these together to make it a little easier to work on. But right now, if I wanted to put an ad 20 validation split, I can do so by going through the split. And so split here, I would have two splits percentage, I'm not going to split it evenly. Each one and quick dullness in the demo template. So these all definitely work. So I've got, I'm gonna have four documents and one that will go through here, this will take quickly and then I could train on one and validate the other the typical way there's running machine learning models typically want to do, check that it works on data model has not seen. So you have training on four elvington one's not going to work terribly well. But you get a model, we can see. In fact, let me skip straight to it and show you what we got here. We're going to view the model through their results. So here's OCR and split two. There's an accuracy tag, I'm going quickly through it and pause the video see that the way we have this in the documentation. Here's sort of the end I'm gonna move this to top twos, we see some of these so on for documents, even though this has been pre trained. We the other one of the examples, we have that just some of the data engineers I didn't want to share for this one we did the typical only use is some internal data on that. So here's something we can share. And we filled out that amount ourselves on forms that are online. And as for we've typically worked with 11 is a common use case 2050. The more data you have, the better the model work, but the more uniform your data, the quicker it will work. And the more maybe it follows typical patterns that our model might have. So sometimes we do have to hit the hundreds, before we can get accuracy that's fit for a use case, if it's really high accuracy, this is a more of how you value the use case. What is the tolerance for errors? How good do I need this model before I can deploy it, or even iteratively get something out there, test it see if it's faster or shadow environment or something maybe it's a little lower than you have to but the final grade we've had when needed to be 95%. And it took us a long time to go after that one. Collecting data. retraining, the model was not what kept it from being a long time it was a struggling trying all the methods and eventually getting there. But in that use case, we had about 1400 documents for you CSS referrals, they had 150,000. So there's no right number for these, just what we found that the more you have, the better off you are. But the more uniform they are, you can almost think of as a human how easy it is, if you were to train somebody how easy is to do this task. If it's all pretty uniform, it's dates, they're all in the same place. Roughly speaking in a document, you have one page versus lots of pages, lots of Nothing to see here can make it a little slower for it to learn what nothing compared to what you really want out of it. So here we can see this model, even looking at the tattoos and learn much is starting to learn the referral reason and the referral clinic. So here we see your accuracy, we've got two different ways of looking at the metrics. So we're looking at classification task. So we're looking at f1 score, which is a harmonic mean, between the precision and the recall, support is important. That's the number of tokens. And so it can it's it, obviously correlated with as much data as you put in there. And, but and how many targets was supposed to find, but also how many targets how many tokens are in that target date can be three December 15 2000 to 21 will be through tokens. So this is a number of tokens as they are down here as well. So we're voting on one documents, there's three and the date of birth, just like I said, only one for the patient gender. So we see f1 scores and in the confusion matrix too. So if I want to see where things are learning, so it's got the referral date and referral reason right when I switch it to talk to but it's still missing over here. The Oh is what we're mainly looking for says to the column, it didn't predict anything. So it's not sure what it is. Because in most of these models, we're not labeling everything.

We're only labeling things we want most tokens don't get labeled. And so that's a pretty important factor of what the model is learning from so the more data you'll get it off of that. I'm not sure about anything. So Have line and start to learn things. And you can track that we some of the metrics that we have in here as if it's really early like four. Generally, we would say, you probably want to put about an hour or two, before trying one of the models of annotating. So you know, it's a funny unit of measure, but it's not a certain number of documents. So it's just get the hang of it. Because a lot of times in the annotation, you may change the way you want to look for how you label it. But that's usually a good starting spot, use, use the annotation set, and then try a model and just see. And sometimes they if you if this was done, you know that the five levels we couldn't be here, we take almost no time at all, once you get the hang of using a label or at all. So getting that up to 2025 50 is you would not be hard for this use case within a retargets we have. So that would be what would suggest that the model will keep learning while the use case that validation. So the judgment use case I talked about the very first model, we had to look at it just that way, they had about 50, maybe 100 documents in it was zero to start, it was very difficult problem to get started. But they put more documents and they could see it learning when we looked at the top two, just like I did on screen and faith to put more documents in. And it hit the 90s as already said, So precision and recall both in the 90 range, one was mid 90s, one was about 90. And for them, it was really strong, that really helped the UI and their ROI is about four times faster than the incumbent process, they had a few doing it, going through that use case. So a lot of them will move off earlier. So I imagine I'll use the thing we've just started for this make session to create datasets. So you'll see it as Google Drive here as these two documents, but public datasets will start to use that and get some more that so that you can learn faster, we'll get the 100 in and the annotations for there. So you can see models. That goes stronger than that. So have on after my time budget quite a bit, I'm gonna start sharing take a look at the chat. So if you're able to hang on, I'll stay on for a little longer, we do have a little buffer on this meeting. So if you want to leave something in the in the Q&A, feel free to do that. Thank you all for hanging along. So the question already chat. So do all the documents have the same format? Medical referral form? Very good question. Let me share that to show you. No, they don't. That was kind of an important one. I didn't talk myself out of showing us enough. Let me share screen. Okay, here we go. Okay, so if we go back to our annotation sets here, I'm gonna just show you the training ones real quick, that'll be fast enough to show, do they have the same format? They do not. So here are these guys. Here's a form, here's a form, here's so they have to have the same form. And you'll see with 100 It's clusters of those just because that's what we got and filled out. But the real use case, they're all over the place. So there's only there's three different forms. Just in this little training said the validation woman pretty sure is a different one. And that's important to know, across most of us, because I'm gonna do the same thing. Sorry about that. Here we go. This one, so you do have different ones? And that's it? It's a great question. So if they were the same, it's going to fit that document pretty well. And so you just have to be thinking like every machine learning problem, I want to see that my test set or validation set, ideally, both is going to be indicative when I'm judging the accuracy of this model for my use case, if I deploy it, and I've given it 500 of these medical center of Aurora ones, and 500 of maybe that Blue Cross ones, if I'm going to get one of those two, then I'm in good shape. If I'm not, if I don't know where that next one is going to come from, I better be pretty careful. And I should really throw as much variety in there as possible. That's really important. And it's often for some of our use cases where we've seen this with BSA bank statements, it was really easy for us to get it the first time in mass, you know, we got our the big four banks and a few of them have different formats. And so we had 1000 of them almost immediately, because you know if some of those, especially if people have been working template style, that's where you're gonna get your labels, if they started with templates, and they just got tired of maintaining them. That's really the UCSF story. They've been maintaining them for two years. Now up to maybe 100 or so templates, they still recognize that didn't cover 40% of their data. And the trouble with the templates is it had no clue about that other 40% So what we're training here is a model that generalizes even if it's never seen that document, as we grow this model, it will understand the clues about how the referral date appears. It's going to be different for each of these. Go back into the classes you can see a little faster. So here we have the referral date.

Over here, the referral date that we gave you. So my date of birth referral dates nine and all this That's, that's okay, too sorry, I haven't really focused, there's a lot of little intricacies we still haven't gotten to, in here. Very commonly thing with the UCSF one they're going after are over 100 things. But on a given document, they usually only have 30 or 40, they've just set it up for the different ways and shapes and forms of different providers sending them. So it's okay to not have this, that's pretty common to not have those on every page, or even every document doesn't contain a certain piece of information. So yeah, that's a key aspect of these, we, you can use templates and documentary I will learn those quickly. We've had a use case like that, where OCR was really the challenge there. So we put our own car in, which we innovated to get, that's how we got docked. Er, the format at the end of the day was the same every single time. So some of those air waybills, they're very, it's a common template, every different. Carrier doesn't slightly different, but it's a similar form. So the OCR is a piece of unlock that it's very difficult to ask to do in that one. So it can be but really where we're situated, especially h2o document, and it has been built to handle where you have high variety, you don't see the same documents, you don't even you might not even know the format you're gonna get in on this thing is in in production. So the key you just want to think about with it is just that while you're creating these validation splits, that you do so that you have the kind of redundancy you expect to get, if you do expect to get the same to have the same two in there, you know, haven't been in proportion to what you see, probably so that when you evaluate the accuracy, it's about what you expect in the wild. So sometimes even if you do have like the in the referrals, they aren't gonna see the same vendors over and over again, they're gonna have those 100 templates gap and the common 60%. So you know, you wouldn't want to go completely with vendors that you've never seen in the training set before, because you'll be under doing what the model can really perform. So you just want to understand as with every machine learning problem, try to get an indicative validation set and test set. So good question. One of the things I didn't show is the pose of publishing a pipeline with this is in the first video as well. And now you can actually test the pipeline, when you have a pipeline up and running, you can drop a file in here and even see. So that would be almost your test set a little bit, that we got to be doing this so much. We don't want to use it through curl requests and other REST API kind of ways of interacting them. So we got that into the tool. So we'll strengthen this too. So people can see the end of the line predictions with a pipeline here. I've wired up my OCR my document ingestion, my OCR method, or embedded text, if you use the dynamic, the model that I've trained, and then post processing, which here we're going to put a vanilla post process with, again, that's the previous demo. So when you publish that, you can use other things we have, we're also making that easier to use, we use the same to almost every time actually, it's just common things you have to do. So Postgres is not too, too difficult. Usually, when it gets customer specific the REST API can usually work with it's a familiar JSON to work with. So we have some customers that use some of our post processing inside the pipeline. And then they also have post processing that happens after it to kind of do some certain formats and things like that, that, that they carry out for their own use case. So a mix of both this was pretty common. But you can test here, you'll get JSON back on the screen, so that you can kind of see how this is working on new documents. Outside of here, you don't need to do that the annotation sets will also do that I've got my predictions. So here was actually what we split out quite well in this sorry, but these are predictions and that we have and it's all text, and you can see. So that was my indicator off the zero line. If we got to 100, we will see them predictions will look like new predictions, here we go. They'll look the same as everything else. They'll just be there. Yeah. So this is this is what the actual model predicted that that had trained on only 11 documents. 20 pages, has trained on different datasets that I used a little more last time. So here we've got a table going on as well. So we've got a lot of different the labels, here. So I'm hitting the up key by the way. So up takes you from all the things that you can see here, we're having label text, and just this ID number. So we've got the text, we got the ID number, we got the label, so you can rotate through those I can my coloring is already situated to be what those are. Okay, all the yellows for the O's. It looks like it's pre chosen blue and blue. For the common thing that we see here. Zoom controls and things like that. There's a button over here. So play with it, do a free trial. Try it out. Let us know if something's I still haven't covered with the two sessions. Go through go through the documentation, but to ask us questions if there's nuances that I haven't covered here.

ON DEMAND

H2O Document AI Part 2

3 Main Learning Points

Read Transcript

Why H2O.ai

Products

Resources

Insights

FOR MODEL BUILDERS

FOR DATA SCIENTISTS

FOR ENTERPRISE DEVELOPERS

ON DEMAND

H2O Document AI Part 2

3 Main Learning Points

Read Transcript

Why H2O.ai

Products

Resources

Insights