Invoice 2 Vec: Creating AI to Read Documents
This talk was recorded in London on October 30th, 2018.
Slides from the talk can be viewed here: https://www.slideshare.net/0xdata/inv...
Bio: Mark Landry is a competition data scientist and product manager at H2O. He enjoys testing ideas in Kaggle competitions, where he is ranked in the top 100 in the world (top 0.03%) and well-trained in getting quick solutions to iterate over. Most at home in SQL, he found H2O through hacking in R. Interests are multi-model architectures and helping the world make fewer models that perform worse than the mean.
- Our Problem: Extracting Specific Information from Business Documents
- Problem Challenge: Is this a solved problem?
- Deep Learning - Object Detection
- Deep Learning: Experimented with 3 frameworks thus far
- Deep Learning: Fast-RCNN
- Deep Learning: RetinaNet
- Training Data Engineering: Document reuse & labels formatted for deep learning
- Driverless AI: Our use of Driverless AI
- Driverless AI: Add Invoice2Vec recipe
Mark Landry, Product manager and Data Scientist, H2O
Branden Murray, Data Scientist, H2O
Read the Full Transcript
We are going to talk about one of the products that we've been working on just for a couple of months. It's titled Invoice 2 Vec, there's a few things we're going to talk about and what that means. But our path as H2O is helping a client create AI to read documents, to extract document information; so we'll be talking about that.
Our team–there are four of us–Branden and I have been working with a particular client: PWC–who spoke earlier today. I've been working with them for about three years, Branden for two; for this project we've stepped up a little bit. We've doubled our team, we have a couple other data scientists helping us out for this one.
And then for all the products we build, of course it's really not just the data science team. We have software engineers, helping us out to build the front ends–that you actually saw on the screens, our stuff was behind them early in the day. And the customer team two; so there's a lot of people. In fact, one of them is practically on our team as well. It's a fairly big group for this one, I suppose, especially in H2O terms for making this product. We've been working on it for about two, about three months almost now, for this one. So we'll show you kind of where we're at–we are not done–we'll show you where we're headed, what the ideas are, what we struggled with, that kind of stuff. So we'll quickly look at our problem, and then we'll look at it a little deeper later.
Our Problem: Extracting Specific Information from Business Documents
We are extracting specific information from business documents, that's what we're doing. Various business documents; the information we're pulling out of them, changes, and the different types of documents. But that's a general idea. So we have an example of one here: you can see an invoice on the left, in this one we've kind of circled a few of the things we're looking at. Again, even for a specific document, we can look at it multiple different ways. We are going to talk about that a little bit. But here on the right is where we're really headed. Just standard old CSV, we're looking at the formatted information that we need to pull out of it. It's somewhat a standard task when you're probably familiar with other people doing it. So what we'll talk about here, we'll dig a little deeper into the problem and the challenges in it–those that we've run into and some that are just native to this problem from when we started. We'll jump right into the high end ones; there's a lot of algorithms going on here–we're using many different things to do many of the different tasks, but deep learning is the most exciting. So Branden's going to talk about some of the image recognition, the bounding box identification type algorithms that we've been using for this task–time permitting–we should be able to do a little bit about what we've learned and how we've prepared the data for this task. It's not a standard one when you're teaching these image recognitions. It's not cat, dog, mouse either, so it's a little different.
We'll walk through what we've done for that, where we succeeded, where we failed already in these three months. And then last, we'll try to finish up with how this connects into the driver list. So the product you've been hearing about kind of all day. We are using it for this product, and we'll talk about that. Also the real goal here is that the result of our work can be fed back into the product, that we can actually have some recipes out of this. So we'll take a look at where we're headed there–that'd be quite a while away–but that's where we're headed.
Problem Challenge: Is this a solved problem?
So the problem challenge; again pulling information out of these documents. When I tell this to a lot of people, they go back to, "isn't this a solved problem?"
Why aren't you using OCR? Why aren't you using XYZ Library? Why aren't you just using YOLO? Why is this a thing you're working on?" And it's true; many off the shelf solutions do a really good job at doing some of this task. For example: on the top, we have a document on the left–a PDF that we would typically see, it's not an invoice. And on the right, a free tool on the internet–when you first Google search–will give you something that looks really clean, really nice. Like not just OCR, where you're getting the text back. It's preserved all the structure, it's put a lot of high end features in there, it looks quite a lot like that document on the left, and that's for free. So the state of the art for pulling some of this information is pretty good. There are a lot of off the shelf tools, but if you take a closer look, on the bottom we zoom into what for us is not just “a '' mistake, but it's a critical mistake. This has used an English language OCR, and in the balance column, a lot of our data's tabular–that's a little different than typical OCR and typical NLP so we're looking at orientation, that's important to our problem as well. So here in a column of balance we see–it's supposed to be euros, but EUR isn't a word, and neither is USD to be honest–but it decided that the probability that it meant "Eta'' was enough that it just switched it; so it used Eta, it probably read the EUR very well.
That's part of what you get with some of the OCR tools. They have deep learning underneath them. They're language specific, you can pick which one you want. It's not that we chose the wrong language–this is in English–but the tool has some pros and cons; and for us, that'd be a critical con. So we can't just OCR it, although the OCR is really good on almost everything else in this document. And it's not just one tool either; we could run this a couple different ways and resolve this problem–and we will actually–but there's other things too, and there's a lot of judgment that goes on. I'll skip to the third bullet point. In some cases, we are interested in the way that this document would be accounted for.
Like Gary spoke about, how in 14 hundreds we have double entry accounting, so we want to see how this document would be accounted. And so dates are really interesting to us, but not every date on the page. We might have seven, we might have a hundred dates, and only one of them is the correct date. So OCR will give us all those dates, it'll give us the text and we can read it, but that's not the entire challenge. We have to then go extract, not just extract what that is–in a nice format to read the words–we do actually have to figure out which is the right one. That's our task. And that's why this is more into the machine learning realm at that point. So now you're talking about a classification model, trained on data, and that's exactly what we're doing here. So there's other people doing that too, even specifically, but we have yet to see something that would do what we want to do right out of the box–where we would use it actually. Technologies we anticipate being in our stack that we are experimenting with now–certainly in many different ways and in series and loops–OCR, natural language processing (NLP), image recognition, and of several different types too. So with that, I think it jumps over to Branden.
Deep Learning - Object Detection
All right. So I'm going to talk a little bit about how we've been applying our experience that we have been doing with deep learning models and doing computer vision on the documents. The reasons–as Mark kind of alluded to why we need to do that–are because although a lot of these documents have embedded text in them, a lot of the text in them is not that good. Sometimes there will just be a bunch of gibberish and sometimes it'll be slight misspellings, but we need things to be exact so we can match the exact words in the document to the accounting documents that we need to match it to. And then another reason is that the embedded text is usually just one giant long string of text, so you kinda lose all the structure. Kind of like Mark said, you might have a bunch of different invoice dates in a document, but we need it to be like, if the invoice dates are usually going to be somewhere at the top of the document, maybe on the right side, or right in the middle, something like that.
What we do is we're starting to use object detection models to identify bounding boxes of where the targets are and the document.
Deep Learning: Experimented with 3 frameworks thus far
So far we've experimented with three different frameworks. We've done Fast-RCNN, and RetinaNet, which have both shown some promise so far. We've also tried YOLOv3, which hasn't been that great yet but we'll keep trying. I'll talk briefly about Fast-RCNN and RetinaNet, since those two have worked kind of the best so far and they have slightly different takes on how to find objects.
Deep Learning: Fast-RCNN
Fast-RCNN is a two stage model. So basically takes an image, we apply a convolution network to it, which outputs the feature map, and from that feature map, we apply a region proposal network–which basically finds in an image all of the possible boxes that the foreground might be in; in our case, the foreground is going to be text. So it will generate probably 1,000 or 2,000 different possibilities of where a bounding box might be but around specific text.
So once you have all these thousands of possibilities, a second stage applies a classification network to it, and that classification network will identify something and say, "this is an invoice date with a 90% chance" or, "this is a stock code with a 4% chance" or something like that. And then once we have all of those, we can pretty much say everything above 80% we're going to keep that. And anything lower than that, we're going to throw it away because it's not useful, it's probably wrong.
Deep Learning: RetinaNet
Then the second one that has shown promise is RetinaNet. This doesn't have two stages like Fast-RCNN does, this is a one-stage detector. The thing that makes this one unique is that they introduce the new loss function that hasn't really been used before and that helps the one-stage detectors. One benefit of the one-stage detectors is that they're a lot faster than two stages.
So this loss function; what the loss function does, is it punishes all the easy examples–a lot harder than it does the harder examples–so something that's hard, it's going to force the model to focus on all the hard examples; maybe if it's struggling with finding an invoice date, that's going to force the model to start focusing on finding that a lot more. So this is an example of RetinaNet results that we have so far. I think we started using RetinaNet about three weeks ago, so this is the very early stages. As you can see in the upper left there, we found the invoice number–oh sorry, all the green boxes are the actuals and all the other colors are guesses for different things–we can see in the upper left that it got invoice number pretty much exactly correct, but right down there right next to it, it missed the invoice date and it missed the company name. It didn't guess anything there.
And then we can see down in the bottom, there are a few false positives. And one just has a box of emptiness, which is kind of surprising, but that should be pretty easy to fix. I should also say that these models were trained on using–I don't know–14 or 15 different invoice formats, which in terms of invoices is like nothing. There's thousands and thousands of different formats. So considering we have used so few this far, I think that shows pretty good promise. And I think Mark's going to talk about the training set.
Training Data Engineering: Document reuse & labels formatted for deep learning
Yeah. So what he mentioned is one of the things we've learned. For us, looking at this from the outset training data is going to be critical because for these models, we don't readily have answers. You can think the natural process of figuring this out would say, "Well, let's get a hold of some documents and the way they were accounted for" that's the natural way to grab data without trying to go do it specifically for your process. But a couple problems with that: one is security–clients say, "Hey, that's actually kind of hard to find that pool. We're not the only ones in that boat." If you look, there's some other people solving this problem, and they've reported similar examples that confidentiality, security, it's important so you can't just go rating documents all the time.
But the real problem is that it doesn't actually help us for deep learning. What that would do–let's say we had that, so we have an invoice, we have an invoice date here. Again like I mentioned, there might be 100 invoice dates and 50 of them might all be the right one. For us, we kind of want to learn the why; that's what deep learning is. We really wanted to teach that the "why" is the right one, not what the answer is–like the actual date January 3rd, it doesn't matter at all, we wanted to learn the flexibility of the thing that follows invoice date in many different languages, many different representations, sometimes it doesn't say that at all. That's the thing that we wanted to learn. So we've gone after these bounding box models; you can look in the bottom right–this is a spreadsheet view of it–but this is the data we've had to collect, which is bounding boxes. For those that are familiar with this world, annotations, it goes by a lot of different names for obtaining these labels.
This is how you do cat, dog, mouse problems. The world isn't labeled often with images; so a lot this is, we're not the only ones trying to do this. There are services out there that will do that for you. But then again, security comes into place and some other kinds of questions. So we've gone at ourselves going simple; the original plan was to create templates, so we would take care to get a few documents really correct–and you can still get through these almost as quickly as you can as an online sort of thing where you go with high speed. But if we capture it in a certain way, we could actually get real examples of that recorded data, get that structured data, and load a lot of different examples, so that it's not learning that the item of interest is a chair–anything can pop in that chair blank–or that the dates, we can rotate dates, change the formats, move them around a little bit.
But that's been the struggle, is that we can't easily move around a formatted PDF–or we haven't yet. We're experimenting with that; so far, we have not done a great job. What they left to us is putting a lot of energy into getting a few templates that...we could get a thousand of these out, but a thousand different versions of one document–even if we switch up the characters and all that kind of stuff–we are overfit massively to the structure of these. It learns that structure, and if you put a new structure in, it stopped too early, it can't generalize to solve the new template. So that's kind of where we are at; the more templates we add, the better off it is, but the batch size is down to a 100, but really it could be as low as 1.
That has been one of the big lessons for me. I like the power of having these templates where I can doctor them and work with them in a way that's not typical with images. You know, the PDFs are structured a lot more than regular images. I want to use that control and try to essentially randomize these–perturb these just enough–so that the deep learning doesn't over fit, but that just enough, we haven't gotten there yet. A simpler method is kind of what I alluded to earlier; online annotation is a simpler way. There are people that have this as a solved problem and their tools are just a little faster than ours, but when you're doing a lot of these, a little faster is helpful. And so we have thrown a 1,000 of these invoices into a tool where we're just kind of drawing boxes over things–which is an interesting process too.
As I have kind of said a few times, we want different label types, we have different models doing different things. And so the one here, I've got giant boxes of tables highlighted in the blue because one of the styles of models we want to look at this–we don't want to worry about every single line item–we want to have a model that is really doing kind of a simple first pass. So things you can do at a header level–if you look at hetero in detail–whether we had 1 line, 100 lines, 1 page, 10 pages, show us where the table of the data is. That's just one model we want to use in our arsenal and so that's a different labeling state. Now, if we wanted to use this same thing to get something else, we're going to have to click again and draw these boxes.
So it's been interesting–labor intensive is definitely something I think we knew we'd be up against because we had to go from zero to something. Are there mechanical turks and other sort of ways of doing this? Yes, there are. Our process is not automated well enough–I think we would make good use of that–but we would look further because with deep learning you need more and more data and the better this will be. And it needs to be of more variety; we're missing in variety and we need to get there. But that's where we're at. Again, we're three months in but we are paying a lot of attention to this. Our fourth member, Shanshan Wang is a data scientist, she aimed at this task.
She's our training data owner, really. And she is experimenting with lots of ways of adjusting the PDFs so we can get these deep learning models to work really well when we need them most–which is when we kind of have crummy scans and all of our other methods aren't returning very good information.
Driverless AI: Our use of Driverless AI
The last bit is how this plugs into Driverless AI. The part where we're currently using driverless AI, I think is a good story because it shows the power of Driverless AI. Early on when we were creating some of these templated data sets, we have a lot of the data that would fool an NLP algorithm, not so much that would fool the image recognition–the ones that need structure–we have a hard time manipulating structure, but the content we can manipulate a lot. So those are more successful and those are the ones we're paying attention to in a different kind of track, if you will. So Branden and Shaikat tend to be doing the deep learning models. I've been doing some of these other ones. And so one of the first ones to come alive was: first, we started with almost first principles like, "Just pick the first date you see just as a benchmark", "How well does that do?", "Not very well, but at least we have something." So the next one was to use Driverless AI. So realizing that a really simple way of what I'm looking for is when we do have that embedded text, or we can use OCR to obtain it–which is the majority of the documents, the large majority fact of the documents–we can obtain reasonably good text with that.
And that's kind of on the right. If we look at every single line just in simple form, we can use our existing labels to figure out whether these targets were in there. Once we have the bounding boxes, we had another way of going after them in simple terms. And so the right most column is actually going to say the target class, and this is just a standard machine learning problem at that point with text, with an NLP; but we have that in driverless. Once I created this data set, fed it into Driverless–again, we're still struggling with the variance even for the NLP, we've gotten a lot better since I've tried this model–but it was at 100%, 99.9 AUC, almost immediately, like within five minutes of running a Driverless model.
The work was to get the data to go into it; the model itself was really simplistic because it's actually pretty easy to figure out all the terms we fooled it with. We have a kind of a list of terms we put in there to jostle a little bit and it figured those out pretty quickly–enough of it. Some of the structure, a little bit of the structure, it hasn't figured out too much vertical position on this one, but it does a really good job of figuring out what we were looking for. The way of using Driverless was actually pretty easy, it's more of thinking of how you get the data into driverless, and I think that that's kind of a data munging problem. Here we're so familiar with our training process, I think we will see this time and time again, just in the way that we're going to have multiple different image models going after different variations of the targets, I anticipate we will have multiple different driverless models out there, all doing different things. So right now, the Driverless model will only take you halfway, just like an image model will. If it shows you the bounding box, that's great; we still need a formatted answer and structured data, that's not the hard part necessarily–or the hardest part. But we need it from here too, because I've labeled the entire string, it's a pretty stupid parser–simplistic I should say parser–every new line is it and we just encoded as a target, whether the presence of that string was there or not. So we then have to extract the data out of this one too. But that's fairly simple actually with what we're looking for with 2 of these 3 targets–trivial, in fact. So this was nice to see.
I didn't have to think much, that's the idea of Driverless. I spent all my time thinking of everything else about this problem; the model itself, I didn't actually code at all, I didn't have to do any of the typical NLP transforms, those are done–some of them are done–we're making that better. But we have all the way our targeting coding and our Tech-CNN was used for this one to do the job, so that was pretty good.
Driverless AI: Add Invoice2Vec recipe
The real goal of this project–and this is kind of exciting. A lot of the work we've done previously, we haven't been able to include back into the tool other than reports. It's helpful to have us with the customers, before I was working directly with a customer, it was as though you find something new every time you deal with a customer data set it feels like.
And that's true. Before I worked at H2O everything, it's just Kaggle; I was amazed at the diminishing returns, the slope is a little different than normal, it felt like you can keep doing competition after competition and just every data says is going to bring out something different. And here we have the opportunity to push this back into Driverless AI, and that's connecting to you. At some point our goal is so that you can pull down the results of our work. You know, sometimes that's maybe too narrow minded for a lot of people, so a specific version would be this Invoice2vec, something like that. We'll load a document–maybe it's not an invoice, maybe it's something else–we have the terms we've looked at, very much like a deep learning pre-trained model, in that world, and a vector of what we think the probability is of the known classes that we've gone after in the past are trained on ours. That's a possibility.
More generally–also either in addition or if that proves too hard, but I don't think it will. That's really our goal is to get that in there in some way shape or form, because we're going to use it ourselves. That's the best way of building these products, we used to do build H203 that way; when a customer didn't have what they needed, let's work on that. So that's the way we're kind of try to do this one. But in general, there's a couple other ways we can do that as well, just taking the tips and tricks, that's how driverless was built.
It's taking grand Masters who've done hundreds and hundreds of data sets, the learnings of that is how Driverless came up–specifically Dimitri–but it's been added onto. That's what we could do too, as we are experimenting with the image domain, that we're not paying too much attention to. Otherwise, at H2O maybe we are allowed to–or we are able to–advise on how that comes into the tool. Or maybe we just extend the NLP models because we have that kind of need too. We're already using the NLP models. We can already see. We need a variance of that. We'll probably see maybe all 3 of these classes improve. And that's it, we're out of time. Thanks for listening and hopefully you'll see the results of our work soon.