What's New in H2O Driverless AI at H2O World Sydney 2022
H2O Driverless AI empowers data scientists to work on projects faster and more efficiently by using automation to accomplish key machine learning tasks in just minutes or hours, not months.
Talking Points:
Speakers:
Dmitry Larko, Data Scientist, H2O.ai
Read the Full Transcript
Dmitry Larko:
I'm going to tell you about what's new in Driverless AI. Driverless AI is a pretty well known tool for automated machine learning. In today's talk, I'm going to speak about the highlighted topics. I'm going to tell you about what we have in Driverless AI and what it has to do with a low code, open source tool we have actually at H2O.
Evolution of ML and AI
Well, first of all, machine learning for the last 50 or so years went a long way from quite simple models to quite complex ones. It's basically been led not just by the revolution and the algorithms in themselves but also thanks to the data we have on our hands. People are actually able to store more data. For example, the very first algorithm in machine learning we have was k-nearest neighbors, which was a very good approach for small data sets. Not just small in size, but also small in the number of columns you actually can use. As soon as you have more columns, k-means tend to break. After additional research and more requirements from a business and from real life, the algorithm starts to evolve to complex ones. Like we have a decision tree, and decision trees can handle some nonlinear linearities in your dataset. They're also trying to mimic a human way of thinking about problems like if else, then, then what?
After that, we switch gears and go to the neural net revolution. We also have different algorithms like gradient boosting machines, random forest, which are actually able to handle table data, which has a different nature of the columns fairly well compared to deployment, for example. Later down the road, H2O comes into place. H2O as a company started from a promise of building fast, scalable, reliable algorithms like you saw in the previous presentations about H2O open source. It was all about allowing business users, allowing users actually, and not caring about the scalability and how to handle huge amounts of data. Just give us the data, and the cluster will take care of you.
Immediately we recognized the problem, but the algorithms were offered, but what about hyperparameter tuning? We need to tune parameters for, let's say, a gradient boosting machine. For example, how many trees to train? What kind of linear top line, what kind of regularization top line? That actually brought us to the second iteration, which was H2O open source AutoML. Now you are not just actually able to train the model. You're also able to put your data in and ask the algorithm to build the model for you, and it'll actually do not just the model building but also find the best hyperparameters settings for you as well. Not just that, actually, it's also able to ensemble different models together to get even better results.
It's a process of building our H2O open source that was actually right in the middle of the Python revolution and machine learning. A lot of libraries were available in Python, and Python was not slow, quite fast. Actually, getting attention and getting recognition is one of the key choices as a language for machine learning and data science. Also, Java had some sort of software limitations from our perspective. It didn't allow us to iterate fairly quickly. We decided to build our next product based on Python, which enables us basically using all available open source libraries and iterates fairly quickly enough to get things done. That's how Driverless AI has been born.
Also, there is one more important thing that actually remains almost untouched by H2O open source, and I'm talking about feature engineering. H2O open source does have some feature engineering capabilities, but it's not feature engineering per se. It's more like a feature representation. It's able to present you a feature, let's say, a categorical feature in a one-hot and calling format, but it doesn't search the best feature representation automatically for you, which Driverless AI does. After we built Driverless AI, we opened the door basically to a new family of new products like the Hydrogen Torch we used to see and Document AI.
Capabilities of H2O Driverless AI AutoML
What's the capabilities of Driverless AI and in terms of AutoML? First of all, we try to make the experience of Kaggle grandmasters into the product. Some common pitfalls to avoid the best approach for handling the data and handling the missing values in the data. The best presets of the hyperparameter criteria to start from because we still have a code style problem, and we need to find the best hyperparameters as fast as possible. We have a preset of the best hyperparameters for XGBoost, LightGBM inside the product, which actually allows you to build fairly good models immediately after the start and then slowly get them better and better. It's fast and scalable, and I've been pushed constantly by the CEO of the company to build not just a very reliable approach in Driverless AI but also fast enough to be useful by the business. The less time you spend building a model than what time you actually can spend thinking about what kind of data to put into the model, into the product, into the tool, and about business problems you can actually solve using it.
It's also multi-model. It actually can handle image data, textual information for you, which sometimes can be quite handy. One of the biggest factors actually of Driverless AI is that it's fairly easy to deploy into production. As a result of your experiments, actually, as a result of a Java CI run, you can actually have a deployable wheel package, Python wheel package, if you choose. Or maybe it's going to be a deployable Java object, which is fairly easy to use.
Easy Deployment of Complex Pipelines
Also, the pipeline basically can actually deploy and can be quite complex, but there's complexity actually hidden from you. I mean, inside this package, inside this module, inside this Java object, for example, can be full feature transformations needed for the model to run and the model of itself. It's completely ready to use. You don't need anything actually from except this package to run it.
Common Usage Patterns
What does the standard process of Driverless AI run looks like? Common usage pattern. Basically, you upload the data, you check the data, you click I want to predict the data, and you go on the second screen, you identify the target, you might identify a test, maybe a validation set you have. You hit the run, you'll launch the experiment, and basically, you're done. The remaining thing you can actually also do, you have to check a warning log to check any identification warnings for you. But let's say even if Driverless AI identifies something suspicious in your data. The only way you can actually react to that, you have to stop the experiment, check your data, change something in your data, maybe change something in the Driverless setting, and start from scratch. That means you still require some data science skills and some knowledge about the data set you have before you run. That implies Driverless AI still expects more or less clean data from you. Also, some default parameters might be not applicable in your particular case. For example, by default Driverless AI might actually think you need a Java deployable object as a result of experimentation. That also limits the capability of Driverless AI in some sense.
For example, we don't have a built model in Java yet. We are working on that, but it's still not in Java. You have to have a Python package to be able to run the scoring package in that case. But if you select the module, and if you select Java. That means the built model is going to be disabled by default, and that's something you have to know. You can learn it from documentation, but you know, it's strange that the model doesn't tell us beforehand. Basically, what I'm trying to explain here, it's some sort of a lack of interactivity in the model. You have to think beforehand about what you can do and what you cannot, what you'd like to get from the Driverless AI, and Driverless AI being quiet, you can actually set up a lot of thinking inside. It's quite flexible in terms of what you can do and how you can actually design your experiment. It is also quite demanding. You have to know a lot of stuff.
H2O Nitro
That's why we decided to start building a different version in Driverless AI. That's the moment Nitro came into play. Nitro is a low code solution. It's open source and built by our team. It basically started from the wave, and then Nitro was released. The whole idea, the whole promise of Nitro, you can build rich web applications in Python. All you have to do, you just need to learn Python. You don't have to learn anything else other than just Python. Building a Python code, you can build a rich application in a matter of minutes. We are going to use this tool inside Java CI, as well. Basically, we're trying to use our own libraries because that's definitely a good idea to do. Nitro is actually going to be used to build live visitors inside the Driverless AI, which actually helps us to iterate fairly quickly and build a lot of different wizards, not without any limitations.
Wizard Enables Interactive Exploration
For example, this kind of Wizard will allow us to do an interactive experiment and add interactivity to the Driverless Ai. Before you're actually able to run the experiment, you can actually click the Wizard, and it actually will guide you for a series of questions and a series of checks on your data. For example, we're going to check the data on a leakage. By leakage here, we mean features which are especially good in predicting the target, and Driverless AI can immediately identify that and say, Hey, you know what? Using this feature, we can have a UC of .99. That's great, but maybe it's too high, actually. There might be something wrong with this feature. Would you like to remove this feature from the data set? It basically tasks the user for the input.
Also, as I mentioned before, it can guide you for expert settings. Let's say if it can ask do you actually need the production and, let's say, Java object into your production code. Are you actually running the model to put it into production? For example, at the end also show you Python code. How to, let's say, run this experimentation with these particular settings from Python, not from the user interface.
Future: More Wizard
In the future, that's going to be the first user where it actually guides you through such an app experiment for building a prediction model. In the next release, there's going to be various different Wizards available. Like you can actually join, the data sets inside Driverless AI without leaving the Driverless at all. You can have a business calculation of different business impacts on inside the Driverless AI as well.
Under The Hood Improvements
Basically, for the last three or so months, most of the changes actually have been done under the hood. We spent a lot of time building a new baseline for Driverless AI and started constantly checking Driverless AI performance against this baseline and against our optimal tool on this baseline. What is good about this baseline? I specifically made this table unreadable, they're very small because the baseline is still under construction, so we're still building and identifying the dataset we can add. For example, right now, we have only 40 plus datasets for classification. That means we do need an additional data set for the baseline on regression problems. It was a great data set because we started and, dare I say, performed really, really badly on this baseline. We've been the best on one data set against 40, and we will reward 17 datasets, like 50% of the baseline data sets compared to the rest of open sourced AutoML tools. That was a cold shower in some sense. We spend the next three months optimizing Java CI in different ways, making it more stable, robust to different noises in the data. At this very moment, our tool basically is producing the best results on 23 of the data sets and from this baseline and just being worse on just one data set out of 4G, compared to the rest of the open source libraries.
That's basically all I have for you for today. Thank you for listening. I'm ready to take questions. I specifically made it a very short presentation, so give you more time actually before the next one, so no questions. That's wonderful. In that case, thank you.