Ask Me Anything with Arno Candel: The
Future of Automatic Machine Learning
and H2O Driverless AI
This video was recorded on May 5, 2020.
Â
H2O Driverless AI employs the techniques of expert data scientists in an easy to use application that helps scale your data science efforts. Driverless AI empowers data scientists to work on projects faster using automation and state-of-the-art computing power from GPUs to accomplish tasks in minutes that used to take months. With the latest release of Driverless AI now available for download, Arno Candel, Chief Technology Officer at H2O.ai, was live to answer any and all questions.
Â
In this video, we answer live audience questions and share whatâs available in Driverless AI today and whatâs to come in future releases.
Read the Full Transcript
Patrick Moran:Â Â Â Â Â Â Â
Okay. Letâs go ahead and get started. Hello and welcome, everyone. Thank you for joining us today for our Ask Me Anything session with Arno Candel on the future of automation machine-learning and H2O Driverless AI. My name is Patrick Moran. Iâm on the marketing team here at H2O.ai and Iâd love to start off by introducing our speaker. Arno Candel is the chief technology officer at H2O.ai. He is the main committer of H2O-3 and Driverless AI and has been designing and implementing high performance machine-learning algorithms since 2012. Previously, he spent a decade in super computing at ETH and Slack and collaborating with CERN on next generation particle accelerators.
Arno holds a PhD and Masters summa cum laude in physics from ETH Zurich Switzerland. He was named the 2014 big data all-star by Fortune Magazine and featured by ETH Globe in 2015. You can follow him on Twitter @ArnoCandel. Now, before I hand it over to Arno, Iâd like to go over the following housekeeping items. Weâll be using Slido.com to display your questions throughout the Ask Me Anything session. Please go to Slido.com and enter the event code AskArno. Thatâs one word. A-S-K-A-R-N-O. Once again, AskArno. A-S-K-A-R-N-O. Weâll be displaying all of your questions on Slido, and you are encouraged to up-vote the questions youâd like to hear answered first. The most popular questions will rise to the top of the display board, as you can see on the screen right now. Once again, Slido.com. AskArno. This session is being recorded, and we will provide you with a link to the copy of the session recording after the presentation is over. Now, without further ado, Iâd like to hand it over to Arno.
Â
Arno Candel:Â Â Â Â Â Â Â Â Â
Thanks, Patrick. Thanks everyone for joining today. Iâm very blessed to be speaking for this great team at H2O. We have a tremendous team of people, not just the grand masters that are so popular, but also the software engineers. The entire team is really working together. The sales, the business units, theyâre all coming together. The products that weâve made in the past have been really novel and innovative and Driverless AI is continuing this tradition where itâs something new that you havenât seen before. And in this demonstration today, if I get to do that, that would be my privilege, but I really would love to answer some of the questions that you have, and also where the times are, of course, itâs challenging these days. What can we do about that? Whatâs the role of data scientists? What is machine-learning going to do? And so on.
Driverless AI is addressing some of those issues already in its latest incarnations, and I will show you whatâs new in Driverless AI to start with and Iâll give a very quick introduction to Driverless AI so that everybody is on the same page. Let me share my slides here. As you know, Driverless AI is a platform for building machine-learning models. Thatâs the gist of it. But the machine-learning models arenât just any machine-learning models. Theyâre actually rich pipelines full of feature engineering, and the feature engineering is the magic sauce inside this platform where it takes the data and changes the shape of it slightly so that the juice can be extracted in a better way. The signal to noise ratio is very high. We can get the best models in the world, and not just for tabular data set, but also for text, or time series, or now images.
 Theyâre all things basically meant to help data scientists to get to production faster. What does it mean, âproductionâ? It means you get a model out at the end thatâs usable, thatâs good, thatâs validated, thatâs documented, that has a production pipeline that is ready for production. And the production pipeline has to run in Java; not in Python, necessarily. It has to run maybe in C++ somewhere, and all of this is automatically generated by the platform, and it can train fast on GPUs and so on. So, itâs well designed to be the best friend of a data scientist. In the last two or three years, weâve been coming up with a lot of features, things that our customer support and all of us, weâve been looking at this whole thing as like visionaries. We wanted to make sure that the product can do stuff that nobody would think was possible two or three years ago, and now itâs reality. Itâs just one more milestone along this roadmap.
We started at the grandmaster recipes, and then it became GPUs, and then it became time series. It became machine-learning interpretability, automation documentation, automation visualization, connectors to all the clouds and all the data sources. It became suddenly deployable on Java and C++ to take the whole pipeline to production as it is with all the feature engineering. This was a miracle, almost. And then we also shipped it to all the clouds on prem anywhere, so itâs really just a software package. We donât care where it runs. Also, it has Python in our client APIs. Now, last year we introduced the Bring Your Own Recipe, which means you can make your own custom recipes inside of Driverless. You can make your own algorithm, your own data, your own feature engineering, your own scorers, KPIs, and your own data pre-processing, and all of that is Python codes.
You can really customize anything. You can add your own PyTorch-Transformers if you want. You can add anything. You can download, join group by augment. You can make your own marketing campaigns cost scorers. What you can think of Python land, you can do in Driverless. Thatâs the rule. And thatâs pretty amazing. I would like to say that in the next generation of the version of 2.0, which will also be long-term support just like 1.8 is right now, weâll have another milestone which is multi-note, multi-tenancy, multi-user, model import/export and all that. Once you have the models in the model deployment platform, you can manage them. You can say, âHey, is this model than the previous model?â A/B testing or challenger model, models that are drifting, automatically re-trained or alert-raised or different kinds of alerts. You can even do custom recipes for those alerts.
We want to empower the data scientists to infiltrate the whole organization. Back to data improvements. Weâll tell the data scientists through Driverless AI what to improve in the data, and then once they know what to improve in the data, they get better data from the downstream engineering, from the upstream data engineering, and then can make insights with Driverless, and then pass on those insights downstream to the business units that consume those applications. We want to make sure that the data scientists are not just given a data set and then they play with it, but they actually can impact the business outcome by going both to the business side, but also to the data engineering side to ask for improved data. Also, the next generation platform will have a custom prototype, so it will have image segmentation, for example. Multi-task, multi-target learning. Whatever you want to do, you can do in the Driverless platform. Thatâs our vision for the future.
This is just a short list of the new features we introduced in our current generally available offering of 1.8, which is the custom recipes, the project workspaces, and then a bunch of smaller, simpler models, and then also more diagnostics, like more validation scores and so on. Iâll show some of those things later, but as you can imagine we connect to pretty much any data source in the cloud, on-prem. Hadoop, Oracle databases, Snowflake, you mention it. You name it, we have it. Custom data recipes allow you to quickly prototype your pipelines, so you can make up new features or compute aggregation sets on convert transactional to IAD data, whatever you want to do as a recipe, as a Python script. You can look at the data quickly. In one click you get the visualization of the relevant information that matters. The feature engineering is our strength, if you want. We make features that are very difficult to engineer and hard to come up with if youâre just starting out, letâs say, and it can have a lot of information in there such as interactions between numeric and categorical columns, and so on.
And then there is an evolution of the pipeline where not just the features are selected and the models are selected, but all the parameters are tuned off the feature engineering and of the pipeline parts that are models. We have millions and millions of parameters that are being tuned and refined, and in the end the strongest will survive and that will be the final production pipeline. We will make sure that the numbers that we deport are valid on the future data sets if theyâre from the same distribution. So, hold out scores, validation scores that are trustworthy. And of course there are a lot of statistics going on deep learning. Not only deep learning; also gradient boosting. All kinds of techniques to use a dimensionality, to do aggregations of certain statistics across groups. For example, for every city, every zip code, we get some mean number, but only out of fold, so itâs all a fair estimate of the behavior of populations and so on, and all that is done automatically for you.
If youâre an expert, you can control it all with these expert settings. Thereâs several pages of these where you can control everything. For time series, there is, of course, a causality that you have to be worried about. You canât know about COVID until itâs happened; have the knowledge about COVID in the last November or so. This is exactly why these validations splits are so important, especially for time series. But sometimes you also have to do validation splits for non-time series problems that are specific, like with a fold column, only split by city or something, and then we shuffle the data. Driverless can help with that as well. For the natural language processing, thereâs many different approaches. With the custom recipes you can do anything you want, but in 1.8 you have TensorFlow convolutional to the current models, in addition to the linear models and the text features that are statistical, but in the version 1.9 youâll have BERT models, PyTorch BERT models, which can improve the accuracy even further.
Machine-learning interpretability has been a key strength of Driverless AI for the last few years, and we have further enriched it with disparate impact analysis where you can see how fair a model is across different pieces of the population and different metrics so you can see, is it singling out somebody or some specific feature of the data set. It doesnât have to be just people. It can also be just in general. Is your model somehow biased? You want to know that. And then there is a stress testing capability where itâs like a âWhat if?â analysis, a sensitivity analysis, where you can, say, if everybody was a little bit older in this zip code, what would happen? Or globally. Or if everybody had more money, what would I have done? Itâs basically an exploration in the meaning of the modelâs behavior. There has been a new project workspace where you can compare models that are organized by you, so you can say, âHereâs the data set and here are my models for this data set,â and then you can sort them by, letâs say, validation score, or by the time to train them, and so on. And you can then look at them visually and see the models, and so on. Thereâs a way to quickly compartmentalize your experiment.
This whole block will then later be exportable in 1.9 so you can share this whole experiment project page with other people. You can share entire experiments based on their project belonging. In 1.8, we also added a scores tab, which means in the GUI you can list statistics about the different models. For example, the leaderboard, you can see that the BERT model is still better than statistical models or better than TensorFlow models in a glimpse, basically. You quickly see a leaderboard, [inaudible] feature revolution from the model tuning part where you can quickly see whatâs happening right now. You can also see the final pipeline, how is it behaving, which model is contributing how much to the final pipeline, how many cross-validation folds did it have, and if it had cross-validation folds, then what was the score in each of these folds? The out of fold predictions. How did they do on each fold? Is it stable across folds or is there some noise in the data? And it helps you to see all the metrics.
As I mentioned earlier, we have Python and R clients where you can remote control Driverless, so if you donât like the GUI you can do everything from the command line or from the programming APIs, and everything is automatically generated, so you will always have the latest version of all these APIs for every version that we share. And then when the model is done you can score the models in either Java or R or Python or C++, which is what R and Python are using underneath, or you can, of course, call the GUI and say, âPlease score me this data set.â The GUI can always score using the Python version. Thereâs two different Python versions. One is using the C++ low latency scoring path, and one is using the full pickled state that was trained during training time. So, thereâs a bunch of ways to score and itâs quite fascinating to see how people are using those in production.
For example, you can train a model with GPUs, XGBoost, and TensorFlow. Then you get a Java representation of the whole pipeline. You stick that into Spark and you score on terabytes in parallel at once, and you have no loss of accuracy. This is what such a pipeline looks like, so you can see all the pieces in the pipeline from the left side. Your original features coming in, getting transformed into daytime features or LACs, and then aggregations between the LACs, and then back into Light GBM to make a prediction. A decision tree, for example, would be visualized like this so you can actually look at the split decisions and see what the outcome is for each case. So, thereâs a lot of transparency in our pipelines. Everything is stored as a protobuf, so itâs a binary representation of these states and you can inspect them, and you see what you get.
This is absolutely the full pipeline that we are visualizing, which is being productionized in the end. So, no shortcuts. Every piece that we ship out of the box has a deployment version of it for Java and C++. And if itâs not yet in Java, then it will be soon because it can call C++. TensorFlow and PyTorch, for example, they work in C++, but once you have Java, you can make a J and I column. Weâre working on giving that to you out of the box as well. So, the rounding of all these extra cool AI features in the production pipeline is happening right now. This is how it looks like in Python when you load this protobuf state and you say, âScoreâ you immediately get the response in Python, and there is no dependency on any PIP installs of TensorFlow or whatever. Itâs all built into that protobuf state, and itâs serialized into a blob of bytes, so you donât need to worry about any dependencies.
If you donât like the MOJO size, sometimes they can be three gigabytes or bigger for a normal 100 megabyte dataset because this is a grandmaster level Kaggle winning pipeline. In this case, itâs 10th place out of the 2,000 or so in Kaggle. This is out-of-the box with one push of a button, right? But sometimes you say, âI donât want 15 models. I donât want 700 features. Itâs too complicated.â You have a button there, new, that says, âReduce the size of the MOJO,â and you click it. The whole thing becomes 16 megabytes, and itâs faster to train, faster to score. The accuracy is still pretty good. So, maybe set with that, see how it behaves.
But, as I mentioned, there is just some presets that we apply. You can also change those presets and make your own custom size of the overall pipeline. You can always reduce it to not be an ensemble or not have more than so many features or not have high order interactions and so on in the feature engineering. So, everything is basically controllable by you. Once youâre done with the whole modeling, you can deploy it to Amazon or to the REST server itâs built in or all these other places. You can go with Java and C++ that I mentioned before. These are just two demo things that we built into the product. And then the open source recipes, the custom recipes, you can do whatever you can do in Python. Thatâs very powerful, right?
Anything that you can imagine. Your own targeting coding, your own voice-to-speech, voice-to-text, to featurization your own scorers based on some quantile of the population, or your own scorer thatâs fair across different adversely impacted populations or your own data preparation where you take three data sets and join them and split them exactly the right way, and then you deturn all five at once in one script. All this is doable today. And you can make your custom models. You can bring CatBoost, for example. You can bring MXNet. You write your own algo, you bring it in. Thatâs how easy it is. We have over 140. Now, itâs, I think, 150 open recipes that you can see how itâs done and you can customize them or do whatever you want. It stays URIP and you can have your insights be part of the platform.
This is just how it looks like. You can bring all the H2O open source models, for example, into the platform. Thereâs more. Thereâs ExtraTrees from scikit-learn and so on. In the end, you can just compare them, and you see that BERT has a lower log loss. In this case, 0.38 versus the built-in ones get only to 0.5 if you enable TensorFlow. If you donât enable TensorFlow, then you only get a 0.6 log loss. You can see it almost is quite a milestone here that you can get to that lower loss for the sentiment analysis. This is a bunch of Tweets and you want to predict the sentiment. Itâs a three-class problem, so positive, or negative or neutral.
You can imagine that having the knowledge of the language, like the BERT model does, it means a lot, right? If you understand what someone is saying, by looking both forward and backward in the sentence using these smart neural nets, then of course itâs better than just counting the number of âandsâ and âorsâ and âhellosâ and so on in a sentence. Thatâs what it is. The same is true for time series. If you have a Facebook prophet model or ARIMA model that you like, you can bring it in as a recipe and then youâll have a different score than what we have out of the box, and sometimes itâs better, sometimes itâs worse, but the interpretability might suffer a little bit because you only have your black box that now makes predictions, and thatâs all you will see in the listing of the feature importances.
But as in our case, you would have at least still this lag of this feature matter this much to our model. Right? Once you make your black box models, you need to think about how you can give them the variable importance back, and if you can do that then we will show your variable importance. If you make your own random forest, then we will show that, of course. The point here is that whatever you do, you need to still be in charge of it, and thatâs why we try to do as much as we can out of the box, but sometimes, like I said, there is a trade-off between interpretability and prediction of accuracy, and you can see it here in the case of prophet. In the end, itâs not better, but it just is one black box Oracle model versus a bunch of lags going into XGBoost, for example.
Thatâs the last thing that I want to mention from the introduction slide section, thatâs the automatic documentation. You get a 20 or 30 page word document for every experiment filled with distribution plots and leader boards and parameters that were used, and it shows you a lot about whatâs going on. I will say this is very useful, especially for regulated industries or people who want to keep track of what happened, and then we can dive into the sneak preview into whatâs coming next, but Iâll keep that for a minute and Iâll address maybe one question in the meantime. Let me look at the Slido page.
Yes. âWhatâs the value of data scientists and machine-learning specialists these days?â I would say even more than ever. Imagine there was a time when there were no computers. Everybody said, âOh, wow. These robots are going to take over my job. Are they going to replace people who type in stuff or who make decisions?â Yeah, of course they will, but you will still be working with them. Everybody now has a laptop or an iPhone in their face all day long, so youâre going to work with those computers, and if you donât, someone else will. There is definitely a need to do that in our competitive world, and you will benefit from these tools, or platforms, whatever you call them, or you wonât. If youâre a good data scientists in 2030 and you donât use these tools, then you better have something else up your sleeve.
There is no way that people that want to be competitive donât use these tools. Thatâs another way of seeing it. They will add value in many places. As I mentioned, you want to be involved with the data preparation step. For example, in Driverless, when you have an experiment, in the next version will tell you that in Minnesota there wasnât enough information to make conclusive predictions, so get better data for Minnesota. Thatâs all possible by the model itself, to figure out where itâs strong and where itâs weak. That can be something you can bring back to your data engineering team and say, âWhy donât we have good data in Minnesota?â Or you can bring it to your management team in the business side and say, âThatâs the reason we donât make money.â And then they will say, âWell, maybe we shouldnât do business there,â or something.
These are all ideas, roughly speaking, where you as a data scientist can be more involved in the overall business, and you want to be entrenched as a business strategist. You donât want to just be given a dataset and then you play with it for five weeks and then you say, âOh, well I have a good AUC.â These days are over, I would say. Itâs more important to know what it means and to know what it exactly does right now, and if that assumption is no longer true, then you should change the overall pipeline and the place where it fits. Tools like ours will help you figure out what piece of the signal is actually strong and what piece of the signal is weak and whatâs driving what.
Weâre getting more into the causal place, letâs say. No longer just, âOh, yeah. I have so many AUC.â AUC alone or whatever it is, is not a good number. Itâs one number and itâs just, âOkay, fine, I can sort it, but then why, and what, and what happens if this doesnât hold? What happens if everybody gets different, the distribution changes? How robust is my model?â We will do the best we can to make all the models we built tailored to be as robust as possible, for example. You will have a knob that tells you the robustness; how much robustness price are you willing to pay. And then whether itâll be simulations that show you what is the trade-off.
The next question is, âWhat is my preferred stack?â I use PyCharm a lot for Python development and we use just Linux servers. I use VI plugins in PyCharm and I have usually multiple servers with GPUs in them, and some without GPUs, and a lot of RAM. They all have NVMEs, solid state drives. They all have either I9 or Xeon chips. I will say a lot of memory is nice to have, but sometimes itâs actually better to have not so much, so 64 GIG is our low end memory version, so to speak. To make everything work with less memory is also a good thing. You want to be efficient. I would say get two GPUs at least in your dev boxes. That way you can paralyze a lot of the model building, not just TensorFlow and PyTorch models. Also XGBoost and so on. Light GBMs. If you can train two models of these in parallel, then per time unit you get roughly five to 10 extra throughput than if you had one CPO only. So, we benefit from GPUs.
And NVME drives, these are solid state drives, they are three gigabytes a second and they only cost you a few hundred bucks. Definitely a good investment because you can read and write files super fast, especially since weâre using data table, which is the fastest CSV reader and writer. You can write 50 GIGs to a CSV in less than a minute. Yeah. I think that was a minute to do the join and the write, and the read actually, from CSV back to CSV. So, the whole thing was done in one minute for 50 GIGs, and thatâs all because itâs limited per the IO of the disk, so you want solid state drives.
 Okay. What is the next question? Oh, yeah. Perfect. Exactly. Thatâs a good one. There is sometimes problems where you have the same person show up multiple times. I remember the all state distracted driver challenge in Kaggle. They had a bunch of rows, but every row belonged to a person, but the person showed up several times in the same data set. You would have one person being photographed holding the cell phone like this and then maybe looking out the window, looking back in the car, looking down, holding the radio knob, and so on, and you had to tell what are they doing; which of the 10 things are the drivers doing that distract the drivers, or maybe they were driving normally. If you just fit a model in all these images, then of course it will remember that the person with the blue shirt is the one that is maybe distracted or something. Itâs not actually learning that itâs the arm that matters, but itâs the shirt color that matters because you already saw that somewhere else, and so on.
So, you donât want to, at the same event, photograph multiple times. Like, me holding my cell phone five different photos. And if I see three of them in training, I can remember, âOh, if thereâs a white arm here, then it means heâs holding the phone,â but it shouldnât have been remembering my arm color. It shouldâve been remembering the arm posture or something. So, itâs very important not to have the other two pictures to be in the validation set, but to be either all in training or all in validation. For that, you would use the fold column. When you get these cardinality limits, that usually will tell you that either itâs something thatâs approaching to be an ID column. That can happen if you have too many uniques. It doesnât really tell you that much anymore because every row is different, so whatâs the point?
Now, if itâs a categorical, that could be a problem. Then itâs just an ID. If itâs a numeric column, like your number of molecules in your blood right now, everybody of us here will have a different number, right? And then itâs okay to be different. So, itâs kind of tricky to handle all these cases, but weâre trying to do our best job with tech types and the data science types. One type is the data type stored in the file. Another type is the type of the data; how it should be used when youâre actually going into modeling. Certain cardinality thresholds can be used to decide whether or not a feature is numeric or categorical. If itâs too many uniques and itâs a number, then it shouldnât be treated as a categorical anymore. Then it should be just a number, as I mentioned.
But I will say the fold column is something thatâs often overlooked. You can also make your own fold column. That is the column that tells you, âFor this value, youâre either left or right. Training or validation.â If that fold column is, for example, the month of the year, then you can say, âI want to train on January, February, March, and April.â But then it will also train on April, March, and January, and predict on February. If you can do that, well, maybe that is a good thing depending on your data. That just means your model is more robust over time and doesnât really need to know everything about the past to predict the future. Maybe itâs okay to do that sometimes, but you have to be an expert. Thatâs why the fold column is a user-defined choice, which column should I use for that.
Okay. Letâs go to the next question. This is about explainable AI and how we can avoid having problems with it and how do we empower it. I would say thatâs a good point to go back to the slides. The sneak preview into 1.9, which is what weâre working on right now, the team has been relentless at building new products in the last few months, not just COVID modeling as you mightâve seen in the news or in our blogs, but also in making multi-node work, collaboration between people work, model management, deployment, monitoring work. This is called ModelOps. Then MLI has gotten a full, fresh face, and also custom recipes. Iâll talk about that some more later. Then the leaderboard project that I mentioned earlier with the press of one button youâll actually get a full leaderboard of relevant models.
In the future, in 1.9, if you have a dataset you can not just experiment, but start a leaderboard, and that will make you 10 or more models at once. Some models are simple models. Some models are complicated. Some have complicated features. Some are easy features. This trade-off between the feature engineering and the modeling, that goes right into this explainability, right? I can make a one feature model thatâs very good because the feature is itself a deep learning model using BERT and embeddings. And then you can say, âOh, wow. I have a GLM model with one core efficient and I get the sentiment right every time.â Well, yeah, because the model got the transformed data output, which is the feature engineering pipeline. Itâs not the model that mattered. It was the pipeline that mattered. We are free to change the data along the way before it goes to GLM, right? Thatâs called feature engineering.
One such feature could be the BERT embedding extractor, and once you do the BERT embedding extractor, suddenly your feature is much more useful than the raw text, and then suddenly your one core efficient GLM is amazing. When you say you want an explainable model, does is mean you want a GLM or does it mean you want no more than 10 kilobytes for your overall pipeline, or does it mean you want to know whatâs in the pipeline? These are all good questions. Another question could be maybe you donât really have to explain the model. You just have to show that itâs robust. Maybe in this day and age itâs futile to try to explain a BERT model to every last number because it contains hundreds of megabytes of state and youâre never going to explain it, so might as well just look at how strong it is in terms of stability.
What happens if I do adversarial attacks and robustness testing and show you that my model is super stable? Well, youâd be happy with that as an explainable replacement, if you want. Is it just good enough to model debugging and show you that itâs super robust? No matter what you attack it with, itâs always giving you the best answer. And maybe it does need this extra complexity to be that good, right? If you have a simple model, maybe itâs not that good at being strong under adversarial impact. If I attack you with a slightly modified data, maybe the simple model will falter, but maybe my super complicated black box model will not. Maybe youâre actually better off getting that model instead of the simple one.
Itâs a very fast, evolving space. The explainability is one aspect, but the definition of what is explainable is also not easy, right? Not everybody will say, âOh, I just need to understand what happens.â Well then you can study model X and then you know what happens. Is that enough? âNo, no. It has to be a little bit simpler. It has to be explainable.â Well, does it need to have one explanation or can it have a million different explanations for a same outcome? Because often you have a lot of local minima that are all the same in terms of outcome, but theyâre all different. Each one is different. You can say, âOh, this is what happened,â and then the next guy will say, âOh, this is what happened,â and itâs the same number that comes out and neither one is right because itâs just one of many possible explanations for the same outcome, and then it becomes useless, right?
Our goal is find something thatâs unique and still simple, and if thatâs not possible then itâs probably better to be complex, but then actually robust. There might be a phase shift transition to a, âShow me that the model is stableâ versus âShow me that the model is simple.â The next thing I would like to mention is the prediction intervals and, in general, residual analysis. Weâll have another slide later, but imagine you knew what the model did. Okay. Fine. We have a model that predicts. Now, imagine you knew where the model was weak. Every time you make a prediction for someone in Minnesota, itâs bad. Much bigger error, as measured in the hold out predictions. Well then, you can try to say, âHey, I need better data for Minnesota,â or look into whatâs going on, right? Thatâs what I mentioned earlier.
Maybe you can make a model thatâs better for those. Maybe you can use that as a metric to guide your optimization process. Do not have any subgroup thatâs worse than the other subgroup by this much. Make them all the same. And then your overall leaderboard that automatically gets filled could be scored by this robustness metric, and then you will have that as the guiding post for you and not just the accuracy, for example. It will be accuracy divided by some other number that is based on the fairness or the stability, and that could be something very useful, and thatâs all doable today because the scorers, as you can see in the very bottom, the metrics that compute the number, theyâre not just looking at actual predicted, like cat/dog/mouse and then the three probabilities and then you say, âWhatâs my log loss?â
No. In the new version you can actually pass the whole dataset. You can say, âOh, Iâm from Minnesota and I have this and this condition, and hence, Iâm going to know exactly what Iâm going to do with my scorer to look at the fairness.â I can look at the subpopulation and so on and convey it differently, and I could, in theory, come up with a scorer that looks at the wholistic view and makes it fair. So, thereâs a lot of ideas, and weâre adding these recipes in piece by piece. Other recipes to be added were the zero inflated models where you have a lot of zeroes for insurances. They donât pay you most of the time. Your claim amount is zero. But once you have a car accident, itâs $6,000. So, the distribution starts at zero. Most people have a peak there. Everybody has a zero. And then some people have some distribution there where they have their claim, right?
Now, the question is how do you make a model that predicts all these zeroes and then all these other bumps at the same time? That is a custom recipe that does better than any other model before because it builds two models at once. One is saying, âClassify me whether I am bigger than zero or not, and once Iâm bigger than zero, then how big is my loss?â And so, I can have two models. One classified and one digresser work together, and then they will make a better predictive model overall. And thatâs just one of many custom recipes that we are shipping out of the box now, and itâs automatically detecting the case where it needs to be applied.
Then the BERT models, the image model, as I mentioned those are not very explainable, but theyâre really good, right? Maybe thatâs a not-bad thing. And then we also work on the epidemic models, and Iâll show you that in a second. But as you can imagine, thereâs a lot of effort going into this explainability because everybody wants to know whatâs happening. At the same time, everybody wants highest accuracy. Usually, some of the people want both. Some want some only, but thereâs a trend towards, âGive me the best and explain it,â and thatâs, of course, not easy. We have even more control over feature engineering. So, how do you train models in the absence of previous data? If you donât have data, how do you create data model? In the case of COVID right now, right?
What would you do if you had no COVID in the past? Well, you would look at other things. You would look at maybe uncorrelated data thatâs not the same old, same old. You would find maybe what are people saying in social media, what are people doing, how are they moving? Mobility maybe. Suddenly, you want to know if a town is being inundated by people from New York that are running away from New York going to Minnesota, letâs say. Suddenly, thereâs all these people that now want their special shampoo that they always bought in New York and you suddenly donât have it in Minnesota, and then every store runs out of that shampoo.
If you knew that, you could probably deliver that same kind of supply to Minnesota and then sell more of your goods, right? This whole supply chain optimization based on where people are moving. Suddenly, now you need to look at the data that says whoâs moving where, where in the past that was not relevant because you were in the same place all the time. So, think outside the box, basically, and see which data sources can be used to augment your experiences. And then I know that you still have no labels, so yes, itâs a problem, but at least you have ⌠Even in two weeks of data, letâs say, you can still get some early signals, right? You can see, okay, we are sensing the demand, so we are building demand-sensing models with our grandmasters to foresee from these early green [inaudible] where people are saying, âItâs okay. Weâre getting out of this.â Suddenly, the demand will rise again. Can we smell that somehow?
You would not just look at the sales numbers over the last 12 months and 18 months and 24 months. Yes, thatâs right. You would have to look at new data, short-term data, social media, behavioral data. Maybe you need more of those satellite imaging data where people are looking at which trucks are on the road where, how much oil do they have, whoâs driving where, which parking lots are full or not, all this kind of modern approach that the hedge funds are taking with data science. That is going to be more important. So, you need to pay somebody to get that data and then to build models in the last two weeks and see if thereâs any signal. How do you know thereâs signal? Well, you need tools like Driverless to tell you which datasets are useful.
Itâs a good challenge, but it only shows that data scientists are more important than ever. You donât want to be just sifting some dataset that you were given six months ago. This is much more interesting to actually deal with the reality of the world. Let me go back to whatâs new real quick to make sure that we covered that. You can export models in 1.9. You can export the model. Once you export it, it shows up in this auto site here, the ModelOps, and then you have a model where it shows you when it was trained and then where is it going to production. You can actually set here, deploy to production or to dev, and then you can look at the model as it scores and you can basically give an endpoint. You can query it. You can look at the distribution of incoming data, distribution of predictions going out, distribution of latency, distribution maybe of whatever you want.
You can make your own custom recipes here, and this is all Pythonic, so youâll be able to have alerts. Like, âIf this happens fall back to model number two. If this and this happens, fall back to GLM. If this and this happens, call me no matter what time,â and so on. All these things are what people are interested in. But another important aspect of this sharing here that I mentioned is that if you export it, someone else can import it, and you can tell who you want to be able to import it. I can make a model for my entire group and then everybody can get it, so we are no longer stuck on once instance, but we are able to share driverless models to everybody else in the organization if choose to. Itâs a more fine-grained control of who does what.
And then the ModelOps storage place here where all the models are stored, thatâs kind of the central place where these models will sit. Then the MLI UI, as I mentioned, got improved, so youâll have more of these realtime popups. Every time something is done you can look at it. You donât have to wait until itâs all done. You also can do custom recipes, so you can build your own MLI pieces where you can say, for example, âI want to build this and this Python recipe. Run me these models and then spit out these files.â And then you can grab those files out of the GUI, basically. We are still figuring out the best way for you to completely customize it all, but there is basically a preconditioned or post-conditioned, if you want. You give us something, and we give you something, and thatâs the contract of that recipe, so you can, in the end, customize a lot of the stuff to your needs.
This is the automatic project, as I mentioned. When you push that create leaderboard button, you get all these models at once. And as you can see, two of them are completed, two are still running right now, and the rest is in the queue. Thereâs going to be a queueing system, a scheduling system that will make sure that the machine isnât overloaded when you submit 12 jobs at once. And if you had a multi-node installation of Driverless, then you will see more than two models running at the same time. You will run eight at the same time on a four node system. Basically, you get more throughput and you get control by not having them all run at once and run out memory, as is the case today.
Today, you would have to be more careful. Later, you can just submit it and then come back when itâs done. This is the residual analysis I mentioned. Also in 1.9, you can have these prediction intervals enabled and it will look at the distribution of the errors on holdout predictions, and then based on those errors it will say, âYouâre in Minnesota. Error is pretty high. Here is your error band,â given the 90% confidence band. So, it does a heuristical way. Itâs not guaranteed. I can never say, âThis will be what happens tomorrow,â but I can only say, âGiven what I saw in your model performance on your holdout data, these are the empirical prediction intervals, 90% of points had this error or less.â
You can look at the residuals or the absolute residuals of your predictions and you can say, âThe quantiles of this error are so that this is where we end up in most cases.â And if you say that, only 60%, then you will get different bands, so you can control that. These are the BERT models. Also in 1.9, out of the box, you see at the bottom-left here all the different types of BERT models. Not just BERT, but DistilBERT, XLNet, Roberta, CamemBERT, or CamemBERT as we say in French, I guess. XLM, Roberta, all these different models are state-of-the-art models that are absolutely fantastic and they really shine. As you could see earlier, I showed you these numbers here from the validation set. Also in the test set, they translate. So, 40 log loss instead of 60 is a huge improvement of 50 here.
BERT is the best machine-learning model for NLP and will be in there out of the box. Best MOJO support even, so you can productionize it. The same for computer vision. You can upload a ZIP file with a bunch of folders. With images, JPGs, PNGs, whatever. You just pull it in, it will automatically make a dataset out of it. It will automatically assign labels to those folders and say, âOkay. These are all the mouse, these are all the cat, all the dog pictures,â and then youâll have an image recognition problem out of the box solved with these pre-trained models that also have fine-tuning now so you can load these state-of-the-art image models and have fine-tuning.
The thing I mentioned earlier was that there will even be a custom problem type, so not only will you do classification regression, but you can do segmentation or SEIR labeling or whatever you want to do. As long as you can define what the model does and what the score is, then you can do anything. So, will not just have one column of predictions, but a column could be a whole mask of segmentation shapes. Like, âThe human is here. The car is here. The house is here. The pedestrian, bicycle is here.â And all of that could be pixel-based maps that are part of the prediction of the model. And then your score is based on all of that, and the reality, and then you get some smart overlap, and so on. Itâs more complicated models will work as well. Weâre going to the grandmaster style AI platform vision and not just classification regression.
Itâs a challenge to do it all in one platform, but I think if anybody can do it, then our custom recipe architecture school be able to do it, so thatâs our goal for this year. This is another hierarchal feature engineering thing, before I get back to questions. Weâll be able to control the levels of feature engineering as you want. You can say at the beginning only do target and coding, and then do whatever else you want to do with feature engineering. So, you can control the first layer. That could be a bunch of transformers or it could be just one. You can say, âI want to just do the dates trans-coding,â where the dates get mapped into day of the year, day of the week, and so on, and then treat those as regular numeric features or categorical features.
You can do embedding with BERT into vector space and then you run our high feature engineering pipeline. Okay. This was the custom MLI. Youâve seen the same screen before, but you can do any explainers that you want with visualizations. This was the epidemic model. Thatâs the last slide. We predicted reasonably well. Some of our teams were in the second ⌠Actually, still are in the first or second or third or fourth place in Kaggle on predicting the COVID spread, and the peak, and how long it will take, and so on, all using models. Some of them are Driverless. Some of them are outside, even simpler models, but this is in collaboration with actual hospitals. This is a SEIR model where we modeled the evolution of the disease, how it spreads through different partial things that ⌠People who are sick infect others who are not sick, right? People who are sick get better over time. People who recover, they donât affect anybody and so on.
Each of these equations down here reflect one of those realities, and then you integrate those equations up from the starting point where only a few people are sick, and if you tell us what the total population is, then weâll tell you how it evolves, roughly speaking, if you also tell us what the rate of exposure is, how quickly it spreads, and so on. Thereâs a bunch of features that you have to basically set for this model. Free parameters, as we call it. If you set those free parameters right, then the overall evolution will be very meaningful. If not, then itâs junk. This is an epidemiologistâs dream to set these parameters and then see what happens. The model that we fit is getting this fit of this epidemic model, but then it subtracts that from the signal, and whatâs left is the residuals, and thatâs being fitted by the rest of the Driverless pipeline.
So, youâll get both the actual boost of the rural, thatâs either global behavior, and then you get the per-country evolution of this epidemic, and together you get one combined model all out of the box. This is also in 1.9. So, thatâs it. Now, letâs get back to questions before I point you to the docs. Yes, automl. Itâs not just parameter tuning or just feature selection, but itâs actually helping with the validation scheme. Thatâs a big one for me. Itâs, âGiven this dataset, how do I make a model thatâs good on tomorrowâs data? Should I split randomly? Should I split by month? Should I split by city? Should I split by time? What should I do?â If you donât do that right, then the rest almost doesnât matter because if you just randomly shuffle and you know suddenly about these people that you shouldnât know, as in the example earlier, or you know about the future that you shouldnât know, and then youâre asking yourself, âDid I do well in the past,â and all these weird things, by random shuffling you can hurt yourself.
It doesnât matter how well you tune the model after you random shuffle a time series. Itâs just not going to work. What you have to do is you really, really have to understand your validation scheme, and only if thatâs done right, only then can you actually start digging deeper into the tuning. Thatâs what the Kaggle grandmasters do. They spend one month on the validation scheme and then the remaining month on the tuning. So, automl should do that validation scheme step for you. And thatâs our biggest challenge right now. We have 14 grandmasters, and I think four out of the top six in the world, and some of them are now thinking about how to automate these validation steps. Okay.
âWhat are arguments, pro and cons, for Driverless?â Well, the pro definitely is it canât hurt, right? Unless you donât have the money. Well, then you have to be an academic user and itâs still free. If you are a commercial user, you have a 21 day trial. Thatâs the current business model. At that point, you have to decide whether you want to keep using it or not. But I would say I havenât seen a single person who has seen Driverless who then said, âI donât really need it.â You always want to use something like that to have a baseline.â Within seconds, it makes you the best baseline that you can come up with, given that dataset. I believe itâs the best automatic machine-learning platform out there, I believe. The only con is that maybe you donât learn as much about it if you let the machine do it, so you should still learn about data science. You should still look at the results of Driverless and say, âWell, why did it to cluster and targeted coding? What does it mean? What is XGBoost? What is Light GBM? What is TensorFlow?â
Iâm not saying stop playing around with those tools, but donât waste your time writing your own for-loops to tune something. Thatâs definitely not a good time invested, unless you are just learning still. âHow long did it take you to release Driverless?â Well, the first slide was done in January or so, or December, around 2017, New Years, letâs say. The vision slide, actually, our CEO made, and all the pieces of that vision slide are now filled up. That started with custom recipes. âBring your own ideas.â And I was like, âWhoa, thatâs difficult.â And then heâs like, âOh, to TensorFlow, to XGBoost, to open source, to everything.â I was like, âOkay. Okay. Thatâs also difficult, but letâs see.â And then try to connect it and have explainability in there and have accuracy in there and have these trade-offs, all these knobs. Make it a GUI. Make it also Python, productionized in Java. The whole pipeline.
I was like, âOh, no. Thatâs difficult. How do you do that? How do you write Python code, custom recipes, and then have a Python code automatically get translated through Java for production?â How do you do that? Itâs not easy, right? Everything takes effort. And then how do you run this in a Kubernetes environment on the virtual clouds or private clouds or whatever? All of this has to work. And we got there. Iâm pretty proud of this. Unsupervised models, yes, we are working on those. Some of those are already unsupervised, as you saw in the visualization, but I would say that the real key is to say, âWhatâs in my data?â There are supervised approaches and unsupervised approaches, and some of the unsupervised approaches that weâre taking are in the pipeline of the model itself. Right?
Some of it, for example, could be that you have to do clustering to figure out what are the right distances between clusters, and thatâs a good feature. But we donât currently say, âGive us a dataset and Iâll tell you what are the five clusters.â That recipe doesnât exist yet, but that could be one of the custom recipes in the future where you can bring any problem type. Yes, that is this one. Driverless is not currently doing a very good job at unsupervised, but we have ideas. Basically, you need to give us more data and let us tell you what is good about that new data. Thereâs no better tool than Driverless to quickly churn through all the different possibilities of feature engineering and telling which ones are good or not, and definitely will not replace data scientists. If anything, it will make it more that you want to be in these days.
Workflows. Iâm not quite sure what you mean by, âWhen will Workflows be supported?â You can script everything in Python today. You can manage [inaudible] notebook that upload 12 datasets or make a custom recipe that does the upload that one time, but then it does 12 splits and all that. So, itâs [inaudible]. Itâs definitely not a problem. Also, from R. And then we have an even bigger with the MOJO; the production pipeline. We cut off the models, we take the pipeline with the feature engineering and we stick it into the Spark environment, and in Spark you will have scoring going on that on the terabyte dataset, you convert the data with this feature engineering, and then the terabyte data will then suddenly be trained on a Sparkling Water cluster, on a GBM model or something, so you can combine the power of Driverless with a terabyte Sparkling Water installation in Spark, so everything will be scalable in big data. So, you can do anything in that land.
Â
Patrick Moran:Â Â Â Â Â Â Â
Arno, I think we have time for one more question.
Â
Arno Candel:Â Â Â Â Â Â Â Â Â
Yes. Yes, you can export MLI graphs as images. Thatâs right. You do have that. I think weâre working more and more on that. All of our pictures right now you can download, but I agree itâs not yet perfect, so we are working on more of this adding benefit to you. Also, we have more inside tabs that show up as insides, and some of those are VegaLite plots, so these are very portable plots that you can port in Word, port any language, any other visualization platform. So, weâre trying our best to make it consumable by you. Thank you very much for your attention today and I hope you can benefit from Driverless AI in this day and age. Thanks for your attention. Take care. Bye-bye.
Â
Patrick Moran:Â Â Â Â Â Â Â
Yes. Thank you, everybody. I just want to remind you that we will provide you with a link of the recording in about a day or so. We want to wish everybody a great rest of your day, and thank you, Arno, for the great AMA session.
Â
Arno Candel:Â Â Â Â Â Â Â Â Â
Thanks. Bye-bye.
Â
Patrick Moran:Â Â Â Â Â Â Â
Bye.