In this Technical Track session at H2O World Sydney 2022, SimplyAI’s Chief Data Scientist Matthew Foster explains his journey with machine learning and how applying the H2O framework resulted in significant success on and off the race track.
I’m Matthew Foster, the Chief Data Scientist for SimplyAI. So, I’m going to take you through a little journey of mine, how I got to this stage, and what I’m using H2O for. And any questions throughout, please feel free to ask as we go along. It’s very informal.
So about myself, I swapped the wind and range of London for the cool breeze of Sydney about 15 years ago. And since that time, I’ve been working on data analytics, real time startups, all with the theme of data science, real time visualizations, and working with businesses in Australia and worldwide to get the real time analytics and data sorted out.
So during that time, I’ve been through lots of different technologies and I’ve evaluated lots of things and it’s been a little journey for me. And so this leads me now to the next person or animal. This is a five year old stallion from France. Now, can anyone guess the name of this horse? Gold Trip.
And the Gold Trip was the winner of the Melbourne Cup last week, if any of you knew that. And what I did was build a Machine Learning model, which predicted the winner. And so out of all of the startups and exits I’ve done and code I have written over the years, I think it was one of my proudest moments to predict the winner of the Melbourne Cup. It was a really good achievement and my friends can vouch for this. They all placed their bets and we’d probably spent most of the winnings on the day, you have a question?
How much did you bet?
I had $200 and I paid $21. And some of my colleagues had the win and placed bets on the same horse. So they did much better than me. So it was a really interesting one.
But the journey to get to this stage started many, many years ago. And it was when I first came to Australia, one of my first bosses actually said, “Matthew, can you do something to predict the horses?” I said, “it’s not that easy,” right? And so I started off with the rules engine, got the data in, and it just failed and I forgot about it for several years.
And then in 2015 I started looking at different solutions for this. And then Machine Learning started to evolve. And we have the CPUs on the laptops went from one core to two cores, then to four cores.
It started to become more interesting. So I looked all the way through those years at these different technologies. I looked at PyTorch, I looked at TensorFlow, I was looking at the GPU training, all good stuff. I looked at Spark clusters, and then I stumbled across H2O. And so in my house, I had several laptops there, and I linked them all together in an H2O cluster. And I started then preparing, learning, getting the data in, and learning the models for the horse racing. But it takes time. And so over the next few years, we are preparing the data. So in a typical horse race, you might have 10 or 12 horses. Each horse we’ve ended up with a data set. Each horse has around a thousand features per race, per horse.
And it goes down to all these levels. Things like the number of races, its run, the trainer’s accuracy, the jockey. There’s all sorts of things going in here. And even for horses with no form, we can bring in the form of their parents. So the horse’s parents are a great indicator. So there’s all these tricks and it’s taken years to get to this stage. And eventually, this was introduced around two years ago. And we’ve been using it since, gradually increasing the bets. And it finally yielded fruition last week in the Melbourne Cup. So I’m really pleased with that.
So that’s where I studied and learned about H2O. It started as a small bedroom project, and I knew it was working because it was delivering good results.
We’re getting about 75% placed bet accuracy on the model which is pretty cool.
So then what happened? So my good friend Darren who was in the audience said, “Matthew, how can we apply that horse racing to business?” I’m like, “okay, sure. Let’s start something.”
So SimplyAI basically, Darren called me over a few years ago to set up the AI Machine Learning division, which is on this slide here. So SimplyAI is a partner for several businesses. We’ve got about 50 or 60 businesses that are using SimplyAI’s consultants and technology and expertise. We’re focused on automation. So intelligent automation, which is robotic process automation. So, all of the data, unstructured and structured data, can be processed automatically by a robot, replacing manual effort.
Those result in savings, we’re looking at 10 million automation jobs so far this year that have been run by companies that are using SimplyAI’s expertise and saving around $8 million of saved funds.
So we’ve got basically the groundwork for this. We’ve got the data architecture experts, the analysis, the development implementation, all of these consultants, we basically provide the foundation for Machine Learning. Without data, there’s no Machine Learning. It’s data, data, data. And so we’ve been partnered with H2O for around five years helping companies implement this technology.
We’ve also got some other cool stuff going on as well. We’re designing vertical applications for insurance, for business, for government healthcare. And so instead of just selling general horizontal frameworks, we can actually sell point vertical solutions.
So let me just talk a bit about what we’ve done recently. We helped a big insurance company. They were trying to use Azure and it basically failed. Now I’ve tried all of the different Machine Learning offerings from Azure, from Amazon. And Amazon, if you look at the SageMaker after a few hours looking out, you’re just tearing your hair out. It’s slow. It’s not well designed.
With H2O, you can get something up and running in hours. Get the model trained, get the predictions running. So in terms of getting something actually into production, H2O is the simplest way to do it. And I like simple things. It’s easy, it’s reliable, it’s robust.
And so that’s what we’ve done for this customer. They’ve got their first modeling production now doing claims predictions. Probably at an estimate of around $2 million per saving per year, and it could be much higher than that. But that’s a conservative estimate.
And so in terms of the H2O product. As I said, it’s simple to use, it’s easy to learn, and those are the key things for getting something in production running reliably.
So how did we do it? So this customer used Snowflake for the Central Repository. And so what we did was, the data from Snowflake is basically there’s a lot of engineering going on here. So as with the horses, we have a thousand features per horse with a claim. There’s many dozens and hundreds of variables that need to be added onto that claim when it comes in. So that’s where the pre-processing comes in. So the preprocessing is done in Snowflake, the model is then scored in Snowflake as well. So all of that data then is in Snowflake.
So the model from H2O is uploaded into Snowflake, the MOJO, and it’s stored there. And then the scoring function, it’s the UDF function in Snowflake, that’s all scored inside Snowflake as well. Then what happens is the results then come out, we have a post-processing script which stores the results, and so it can be analyzed and that’s critical for the data drift and for the model accuracy dashboards. And then eventually the results then go to the internal business application.
So H2O needs this integration, so you have to get the data into H2O first, get the model back into Snowflake, then the scoring occurs, and then outcomes the predictions at the end.
So there are challenges with this. One of the challenges with having the scoring function inside Snowflake is that it uses Java internally. So each function call actually starts a Java runtime. So there’s some challenges there that we need to look at.
The better way to do it, which is what we’re moving to next, is actually using the MLOps endpoint, and then it can be a faster and more real-time scoring system. But that’s the way we set this up. We had a limited time to do this and we’ve found the best way to do it was to do it inside Snowflake.
So some of the things that we did with this customer. I always like to look at these three foundations. They simplify, automate, and monitor. I love these three things.
To simplify, what we did was we looked at the preprocessing code and we found that about 40% of it could be removed. And so if you can take 40% of the code out of something and get the same results, you’re going to have a more maintainable solution in the end.
Automation, what we did here was we looked at the test cases that were required for the model accuracy, and we actually automatically generated the test cases as well as scheduling them. So when the model was going between environments, all of the test cases can be rerun and the results can be then automated onto a dashboard and alerting for the system.
And then monitoring as well, which is critical for models in production. You have to monitor the performance. Is the data that your model is running in production the same as what it was trained on? And that’s where we’re able to look at the model drift and data drift.
And so just the same with horse racing. If the horse’s behavior changes, then it’s going to change your results. But what we’ve found with that, and it’s with most things, is that once the model is running, you can then get a really good angle on the results. And as long as you’re monitoring it every day with alerting then you can really get good performance from the system.
So the other use case we looked at was with NLP Hydrogen Torch . And this is a much more complicated use case. The Driverless AI is great at doing classification and regression, and to be honest with you, it’s much easier than NLP. NLP is difficult.
And so what we’re trying to do here was, we were looking at policy documents and within the policy documents they had about around 20 to 30 actually labeled. So we labeled the paragraphs in that policy manually. And then we started to see, okay, can we use that model to then predict the labels in the other 50,000 policy documents? So these policy documents, each one’s different format. There’s like 50,000 of them, they’re all completely different. And we used Hydrogen Torch for this and the model that it builds from this can be built in maybe 10 or 20 minutes of training.
And then what we did, to generate the scoring, we have to then take a candidate policy document and we generate all possibilities of paragraphs, so from one to 50-sentence lengths and then we score those paragraphs against the train model and it’s able to then predict where the paragraph is in the document that we’re actually wanting to label. And so we’re able to then look at this and look at the risk of each of these 50,000 policies that we had and then start looking at the trauma definitions, risk definitions, and it’s pulling them out, allowing the user to see what’s inside this document.
And so what I did was we wrote a Wave app . And Wave apps are useful for the technical side of things. If it was a Power BI dashboard connecting to Snowflake or database, that’s fine, it’s easy, but for this, it needed that complicated backend to actually split out this PDF file until all of the different candidates and then you do the scoring internally.
So this little Wave dashboard allows the user to upload a PDF file and then it takes around maybe 30 to 60 seconds and it comes back with all of the labels of the paragraphs inside the document. It then stores the results of that and then the user can then search for a definition they’re interested in. It will tell them all of the policy documents that contain that definition.
They can also do a free text search as well. So full text search is something that I’ve been doing for a very long time and we’re able to just do a little extension of that in the Wave app. So the Wave app itself can have a trained, full-text search library side allowing you to do full text search across thousands of documents instantly all within the Wave app. And the user can then click on a policy document and then actually see the PDF file that they were searching for. So that’s basically what we’ve been doing for the last couple of months with this customer.
And so in terms of H2O, I’ve shown here how we can use it for horse racing. We can do it for claims predictions, look for policy definition lookups as well. So it’s a whole range of different use cases.