Arno Candel, H2O.ai - A Look Under the Hood of H2O Driverless AI - #H2OWorld
This session was recorded in NYC on October 22nd, 2019.
Driverless AI is H2O.ai’s latest flagship product for automatic machine learning for the enterprise. It fully automates some of the most challenging and productive tasks in data science, such as feature engineering, model tuning, model ensembling, model interpretation, report generation, and production deployment. Across industries and verticals, Driverless AI takes datasets and creates grand-master-level machine learning pipelines with minimal human input required. It also produces standalone scoring pipelines for Java, Python, R, and C++ for low-latency inference in production without any approximations. Driverless AI is designed to avoid common mistakes such as under- or overfitting, data leakage, or improper model validation, which are some of the hardest challenges in data science.
With bring your own recipe (BYOR), domain experts and advanced data scientists can write their own recipes in Python and seamlessly extend the Driverless AI platform with their favorite tools from the rich ecosystem of open source data science and machine learning libraries. Other industry-leading capabilities include automatic data visualization, machine learning interpretability, automatic report generation, and enterprise features such as security, authentication, data connectors, and model management.
Arno Candel is the Chief Technology Officer at H2O.ai. He is the main committer of H2O-3 and Driverless AI and has been designing and implementing high-performance machine-learning algorithms since 2012. Previously, he spent a decade in supercomputing at ETH and SLAC and collaborated with CERN on next-generation particle accelerators.
Arno holds a PhD and Masters summa cum laude in Physics from ETH Zurich, Switzerland. He was named “2014 Big Data All-Star” by Fortune Magazine and featured by ETH GLOBE in 2015. Follow him on Twitter: @ArnoCandel.
Read the Full Transcript
Thank you. Today, we’re going to talk about driverless AI and David [Biding] will give you an awesome hands on experience, where you will actually run experiments in driverless AI just in a little bit and I’ll give you a brief introduction. What is driverless AI? What makes it different from other autoML solutions and what is basically the key takeaway for you? Before we do that, go to aquarium.h2o.ai and create an account if you don’t have one already. All you have to do is provide your email address and then in that email, you get a password that you can use to login. So aquarium.h2o.ai. It should be self-explanatory. If you have questions, please raise your hand, someone will help you.
Now, driverless… Driverless AI starts with a data set. So you provide a data set and we do the rest, if you want. We fit a model on it, but the model doesn’t just get your columns of the data set. It gets transformed columns of the data set. It’s called feature engineering and it’s done with grandmaster level precision of the best data scientists in the world. You saw the grandmasters earlier on stage. Those people have provided ideas that are crowdsourced for the last decade, let’s say, of data science evolution has made it into this product and any new idea can be put into the product as a recipe and if somebody has a better idea than us, that recipe will win and we’ll put it into the product right away. So it’s a plug and play architecture and the there is a bunch of preset pieces in there that help you make better models.
And all you need to do is provide the data, say, “Go,” but you can configure it, of course and you can see how it evolves over time. You can see what’s winning, what’s losing, is Random Forest better than XGBoost, what’s going on? So the point really is that it will help you make models faster and avoid mistakes like over-fitting or leakage. And it will basically do the best it can to fit on a given data set and not just with the raw features, but it will do binning, targeting, coding, rate of evidence, all the techniques that are known in the financial industries, but also other techniques like clustering, dimensionality reduction, you can do auto encoders inside, you can do NLP models like TensorFlow or BERT, all these new advances. You can take a text blob and turn it into a vector, all automatically.
So this is done by driverless AI on GPUS or CPUs, for time series or stationary problems, for both cases, all out of the box and everything that comes out at the end is not just a model that’s in the GUI, but actually the thing that comes out is a production pipeline. The production pipeline is standalone code, Java or C with connectors from R or Python that are standalone. No pip installs, no nothing. Just one file, Protobuf, the run time loads, this Protobuf in either Java or C and just scores. So low-latency scoring and it includes the entire pipeline. Thousands of transformations, clustering, target and coding, binning, lookup tables, all that stuff is all in Java or C++ and all the models, even ensembles of models, all in this one file.
So that’s the beauty of driverless. It takes away all the burden, let’s say of fitting, wondering what depth of the three to use, or is GLM better than XGBoost or not, what’s going on? Do I need to do [inaudible] encoding by hand? Do I need to do transformations of the categorical variables by hand? What do I do with text? All these worries are gone. If you want to worry about those things, you can still make your custom recipes and say, “I know better how to turn text to vectors,” and give us those recipes. But by default, you don’t need to worry about any of that. You just worry about the data going in and maybe how you want to provide a [inaudible] column or a fault column to create these custom splits that are then used for fitting. So you can configure the system a little bit, but you shouldn’t have to worry too much. Obviously it’s usable in every industry where there is a need for machine learning and we have lots of customers showing that that’s possible in pretty much all aspects of life.
This is the roadmap. All of it has been done. The last two things are about to be done. We’re talking about operationalizing the models in a way that Tom and Olivia were showing earlier on stage, basically with the whole life cycle management, whether you should do the AV testing, the [inaudible] challenge and all that stuff, which model do you need to replace? That’s coming. Then also the multi-node multi-user. So if you have a hundred people, they all want to share one data set and one experiment, you can do that in the next version.
Right now it’s more like a single user experience where you can still have several people log into the same box, but by definition, it’s this one box that runs this one data set, this one problem. And you can have hundreds of data sets and hundreds of problems, but when you run them all at the same time, there’s going to be some collision of resources. It will still work. The GPUs, for example, are not stepped on by two people at the same time, we’re smart about that, but still the CPUs and the memory might suffer a little bit. So right now it’s mostly for one or two or three people at a time, but later it will be totally scalable to a cluster of machines.
And needless to say, we have connectors to every possible data source and we place very well on Kaggle out of the box. Most problems we should do quite well compared to human effort for even months at times. The data connectors are here. All the clouds, all the databases, flat files, KDB, MinIO, you name it. You can do queries to Google, big data backends. You can do even your custom recipe. So now you can, in the next version that comes out in about a week or so, you can write Python code that creates artificial data from random numbers, or you can take a data set from the internet, join it with another data set, group it, aggregate it, slice it and return five splits all in Python. Anything’s possible. Your complete Kaggle solution in a box if you want. This is going to be really fun.
This is how it looks like when you do one of these custom data transformations. In the GUI, as you have imported the data set, you can now change the data set and add new columns. You can do arbitrary Python expressions, Panda’s non-pi data table doesn’t matter. You can make REST calls to Google cloud, anything in Python, and you have a live preview of what the data looks like after that transformation. String splits, augmentation, you name it.
Still, as always, automatic visualization by Leland Wilkinson the guy who wrote the book, The Grammar of Graphics, that is behind [inaudible 00:00:06:59], ggplot, all these libraries. [Bocquet 00:07:04], that’s his invention, if you want, The Grammar of Graphics and that’s built into driverless AI as well. So you get an idea of what’s going on with these different data sets that you’re putting in. The target encoding, the grouping, one side of this side here, the left side, that’s the additional stationary data, do all kinds of massaging on the data. For time series on the right side, we tend to use more lags, causality, preserving, looking back, not looking into the future. So you can just take everybody’s mean of the outcome if you only are allowed to look in the back. So you have to be a little bit smart about the splits and all that is of course done.
Driverless AI doesn’t just tune models, it tunes pipelines. And the pipeline is an arbitrary set of columnar transformations, in columns [inaudible 0:00:07:55] columns out. So you can take 12 columns and say, “I want to compute their difference, [inaudible] difference.” And you end up with thousands of output columns or whatever, depending on how many other interactions you want. You can do that as a custom recipe, for example, but we do many things like two columns in, five come out. One goes in, one goes out. It’s like a spaghetti factory. You put more in, something comes out, sometimes nothing comes out. It says, “I didn’t find anything.” The transformers really are doing whatever they need to do to take data and spit out more useful data. And it’s the same number of rows, but the transformation can be anything. You can do clustering, binning, whatever it is.
And all these transformations get then fed into a model as input data. So they actually boost the TensorFlows of the world, they’re going to see that new data, the new view of the data that is cleaned up, missing values removed, binning, no more noise, autoencoder, smoothing, whatever it is, it’s all massaged to be good for data science and that new view of the data, that’s presented to the algorithm and the algorithm will say, “Okay, I’m an XGBoost of depth five. And then later it will say, “No, try seven,” or “Try nine and see if it’s better,” and whatever wins in the end, in this evolutionary strategy, that’s what we will give to you at the end. We might even do some blending and stacking at the end and have multiple models at once.
And we do that in parallel over many GPUS, many iterations, hundreds and hundreds of models, thousands of features and the best overall pipeline wins and the best overall pipeline comes out at the end as a standalone deployment artifact that you can take outside of driverless on any Linux box, R, Python, Java, any windows, laptop, Java will just work and score that model. Low latency, REST APIs, Cloud APIs, everywhere. So it really is made easy for you if you’re a data scientist or the data engineer or the IT person, this should solve a lot of problems.
We do deep learning and statistical learning, everything under the sun. We have full control for expert settings. So if you want to configure something, you say, “I don’t like TensorFlow, turn that off,” or, “I want to only use TensorFlow,” you can do that. You can specify the amount of feature engineering. You can select every single transformer on or off for your custom recipes as well. They’ll show much more of that in the coming presentations today.
For time series, as I mentioned, we automatically figure out the gap between training and testing. We figure out the splits, we figure out everything so that the validation scheme is good. I still want to implement a custom validation scheme so that you can say, “I’m want you to do these five splits: this month training, this month testing, then a gap or two months training gap, two months testing, three months training, one month gap, five weeks testing.” Whatever you want. Those you should be able to provide to us and we take those splits. Right now we say, “We are going to figure it out automatically.” So that’s going to come in the future, this automatic presentation to you of what we want to do and then you can customize that or you can provide Python code to tell us the exact splits.
For NLP, we have a lot more coming later. [inaudible 00:11:17], one of the grandmasters and experts in NLP, will present to you this slide. So we do convolutional neural nets, Bert, PyTorch, TensorFlow, character level encodings, work level encodings, all languages in the world from scratch or pre-trained embeddings, everything is possible and shipped out of the box. Interpretation, Patrick Hall has been talking about it, he’s going to present more about it soon. Interpretation is big, we can do all kinds of stuff. Disparate impact analysis we can do, see if somebody is discriminated against, we can see what happens if this was a population that had six times more cars, would we think they’re more fraudulent or not? What happens if you have NLP? So which words mattered for this positive outcome? Is it because they use the word good or bad in it or is it something else maybe that triggered it?
There is a project workspace that we use for leaderboards. So you can build 12 models and have them in one page and then sort by accuracy. You can also compare these models and have a visual next to each other presentation of those charts, for example, to see which one is the best. So this is typically what we would recommend. If you have no clue about your dataset, what’s going on, you build a bunch of easy models, just linear models, just Random Forest and then maybe some XGBoosts, then some TensorFlows and then compare all of them.
And of course, each one itself is fully deployable with automatic report generation, with the full pipeline productionized on the full data set. We have Python as our client API. So if you say, “I don’t want to click, I want this to be a batch job overnight,” you can customize the code to just run in Python. You provide a five liner, say, “Load this data set, run this experiment, download the production pipeline, upload it to S3.” That’s a 10 liner script in Python, you run it every so often. The same from R.
And then when the model is done, you get these low-latency scoring run times, either Java, Python or R. And we have a command line version for the C++ one, which basically tells you, “Give me data set, score it with this model, done.” And the model is then just a Protobuf state. Run file, binary dump of all the decisions that need to be made from raw data that you gave us all the way to the probabilities at the end order regression. We also do multi-class by domain. So you can have 12 different output classes and all of those will have probability. No dependencies can be embedded anywhere, especially the Java one.
The scoring pipeline can be visualized so you can see how it looks like, what are the pieces inside. Imagine it’s like a graph, TensorFlow has come up with all these graphs recently. Every one of our pipelines is such a big graph, usually a hundred times more complex than this one, many more transformations in parallel, many more models blended together. But if you want the compliance model, it’s easy, you’ll have just one [inaudible] encoding in linear model, nothing else. So you can have it as hard as you want or as easy as you want for the machine learning process. And in the end, it can be deployed with the push of a button if you like that. You can push directly to Amazon Lambdas or in inside driverless as a REST server. So the instance advanced experiments can also be in your server.
For prototypes, you can make a little shiny app and just say, “Here, Curl, send this request over,” and back comes the answer. Q will do this for you on steroids. What you saw earlier, this app building an AI magic land where everything’s possible, Q will take this to the next level. But until we have Q out, you will just get a REST endpoint that you can score your data on. The open source recipes that are going to be covered in the next session at three o’clock, they’ll do a hands on on all these recipes. You’ll build your own score, transformers models and even make [inaudible] data.
Of course, anything in Python land goes. We have a lot of recipes, you can see here left all the models. If you say, “Oh, H2o doesn’t have SPMs,” wrong. We do have SPMS. Everything’s possible. There’s literally no end in sight for what’s possible. The question is just, what’s better? So you’d have to compare yourself, “Oh, TensorFlow or not, TF IDF only, maybe a sentiment package downloaded.” There is sentiment packages on the internet, it’ll pip install something, immediately have a sentiment model for texts. You can just plug it in as a recipe and see if it helps. In this case, actually, it did help. It’s better than the TensorFlow one and better than the statistic one on the left side.
You can do time series, profit, ARIMA. Facebook’s profit is famous, you can use it. And we automatically do this time series per group. So if you have stores and departments, each one has its own time series. All of it will be done at once in one shot for every group, we fit a different model. And that’s for your custom recipes too. And we have templates for those custom recipes. So we’ll cover that more in later stages.
Of course, if you have a custom recipe like profit, it’s a magic black box. [inaudible] will say, “This is how much it mattered,” but there’s only one column coming out of this magic black box, which is the prediction. And we have to say, “Well, that’s the only thing that matters.” It’s not going to be as explainable because it made a number and that was a good number. Our version with lags saying the 52 weeks ago number mattered is more intuitive, more explainable. So that’s why we chose to do lags because you can actually see which number from the past has impacted the predictions for today versus just some black box made the number.
And as I mentioned earlier, every single experiment has a full automatic report generated for you about 30 pages right now, distributions, leaderboards, tuning tables, assumptions that were made, feature engineering that was used, models that were tried, everything is in there. Partial dependence plots. You can see the support for each of those. So the bottom bars show you, where was the data? So nobody had a negative payment amount or something. And that’s the ML Ops that’s coming up next.
So I think that covered roughly the introduction and quick question for you. Which of those statements is true? What can it solve? All of them, yes. It can do fraud, churn, pricing, forecasting, time series, NLP. Video is a good one. Who wants to do a video recipe? We actually haven’t done that yet, but in theory you just have a path to a video file as the feature and then the custom model will say, “Okay, let me load it. Let me cut up the images into a sequence of images. Let me run some transformer on it as a batch job, let’s say [inaudible] or Pytorch And then let me fit something.” You have to have an outcome for each file. So this movie says, “I have a cat inside, yes.” This movie says, “I don’t have a cat inside,” for example. And then you can have video classification, is there a cat inside?
And I live in Los Gatos, that means “The Cats”. There is literally mountain lions in our backyard. So I literally, I’m interested in doing a video [inaudible] that tells me, “Was there a mountain lion outside at night?” Because we have a security camera, but I don’t think it tells me there’s a mountain lion, it will just say something moved, but it does it every day so it’s kind of annoying to get all these alerts. I want to know, is it actually a mountain lion? So now all we have to do is make data that has labels. So now we need to go to the internet, find a bunch of mountain lion movies and then label it and all that. So it’s still a lot of work and that’s the only work that’s left basically for data scientists is to deal with the data and think about what data do I want the thing to fit? And we’ll take care of the fitting part and of the deployment part and of the [inaudible] part and hopefully also in the teaching part, what worked and what didn’t work.
So with Q, we’ll have much more interactivity. We will say, “Look, this is the distribution before we did [inaudible 00:19:28], feature engineering. And here’s the distribution after feature engineering,” and between train and test, the data must look much closer in the engineered space because our goal is to remove drift, for example, that the model is more generalizable between training and testing. Even though train and test distribution was different before we transformed the data, now it’s the same because we did some groupings and we did the averages and so on and we smoothened the landscape so that the model is more relevant to you and this kind of insight we want to bring out. So it’s clear to us that a black box is not going to do the job and we want you to be part of this and we want your open source recipes to be part of this and anybody’s in the world’s open source recipe should come into the platform to make it a better experience for everybody. So thanks for your attention and let’s get David Biding up on stage so we can start with the hands on tutorial. Thank you.