Driverless AI - Introduction and a Look Under the Hood + Hands-on Lab - Arno Candel, CTO, H2O.ai
Driverless AI - Introduction and a Look Under the Hood + Hands-on Lab - Arno Candel, CTO, H2O.ai
Talking Points:
Speakers:
Read the Full Transcript
Arno Candel:
Hi everyone before we forget, our team is really amazing, so unfortunately not everybody can be here on stage, but they all deserve to be on stage. I've never worked with any team that was as excellent as ours and everybody's so motivated. Please, hands up the air applause for our people. Now, we start with Tom and his instructions to get our quick lab account set up. Please go to this URL there and, I'll reach the microphone over to Tom, but you have to follow these instructions. If you want to follow the hands-on lab in which you can actually run driverless AI today on your laptop, all you need is a laptop, nothing else.
Enabling h20ai.qwiklab.com
Arno Candel:
Just follow what it says here. You'll basically go to url. Quick lab is spelled a little funny. It's Q W I K L A B Qwik Lab and the, prefixes h 2 o AI sub domain. You go to h2oai.qwiklab, Q W I K L A B.com. And there you have to sign up. You say join, you give your information and you get an email that says click this link. As soon as that's done, you'll get back to that page and you'll be able to log in. You'll see a catalog of choices and there will only be one choice. That will say, driverless AI and one gpu. That's what we need. As soon as everybody is there, we can continue. If anybody has problems, please raise your hands. We have people here in attendance that can help with that.
One more time to follow these instructions. You go to h2oai.qwiklab.com. Quick lab is Q W I K L A B.com. You sign up by joining, you enter your information, you get an email, you click that confirmation link and you'll be in the system in no time where you will have access to this pre-built image that will run driverless AI on a hosted machine that has a gpu and also of course a reasonably fast cpu. At that point, you'll just wait for our instructions to continue. So is anybody not yet there? Please raise your hand and we'll give you more time. The website that you have to go to is h2oai, not H two Zero, but h2o like water, dot q w i k l a b.com.
So water molecules have two hydrogen, one oxygen, not 20 hydrogens. Okay, I think any browser will be fine. It shouldn't depend on the browser or anything. We like the Chrome browser. Who has it working by now? About half.
I think we have enough to at least have somebody around you nearby help you. It's basically just necessary to you for you to have access to this system to be able to run driverless AI hands on. You can always run it later. You can go to our website and download a docker image that you can install with the instructions that are on our webpage. Then you can also run h2oai even on your own computer, but it has to be a laptop with at least eight gigabytes of RAM. It should be like a powerful enough machine. It can't be a small one, but we have seen it running on max on Lenox boxes and Windows everywhere. I think we're good. As we all know, there's not enough data scientists, especially not expert data scientists. Should I show you how to start the lab? Okay, that's a good point, I will let you all start the lab. This is the lab. For those who haven't clicked on it yet, click on it. This will get you into this field. And here you say start lap at the top right, the green button. Okay.
That's the most important button. Sorry about that. If you press that button, you're good. Don't press it again, just one time. Make it become red. Once you click it again, it'll stop, then you have to restart it. Now you have 57 minutes left of running this instance, but it has to boot up first and while it puts up, I will talk, okay, about something else. I need to be fast, that's why I was trying to get through it. We'll get back to this website and we'll copy paste a URL that shows up on the left here. You see it here. This one you have to copy, paste and then go to a browser, paste that in, and wait what happens.
Introduction Of The Product
At some point you'll see this and if it's already there, then perfect, then we can continue soon. But let's not do that quite yet. Let me introduce the product first, because that was part of the agreement. Data scientists, we have a bunch of grand Masters and Kaggle masters in our company that help us, device these products. So basically we are making a tool that can avoid human error and automate some of the laborious tasks such as, model tuning, ensembling, automatic cross validation, feature engineering, detection of time series, some things that are done over and over by people today. Some of them are making mistakes and we want to avoid that. In this, in this example here, Andrew Ng has, at least, agreed to publish a paper that had a mistake.
It's just recently. it's important to not just sample randomly if the individuals in the dataset are correlated to each other in this lung image problem, you would have not memorized that there's a tumor, but you would've memorized the shape of somebody's lungs and you would've said, oh, I know that this person has a tumor, because there's two rapes there in that shape. If you detect those same two rapes again, you'll say, oh, I remember that's a tumor, but it's not because there's a tumor. You basically just memorized the rib structure. So, avoid making mistakes where the data is not random, but you pretend that it is random. They have to republish the paper, took them about, I don't know, 10 days or so, and now it's corrected. These mistakes should not be made because it's unnecessary.
You can just have asked in the beginning, Hey, are these rows independent or that, do they belong to something that groups them? If so, provide me that column that tells me the grouping and actually the product has that fold column in there to avoid these mistakes. InfoWorld said it was an editor's choice award, so that's good. There's only a handful of these a year, so we are very happy about that. You can read that in more detail. I ran it last night just to make sure that it still works and it performed very well. Driverless AI got me to the 18th position out of 2,926. That's on a Kaggle challenge that has completed, that's no longer active, but still all I had to do is click once and let it run. I had a solution that was as good as no one else, basically other than some Kaggle grand Masters, three of which work for us.
Paribas Kaggle Competition
Those three pictures in there, if you can detect them, you get a free beer tonight. Basically weeks, right? Weeks and weeks to get there. We just automated that part. It's already a good product. This Amazon challenge, that was the paribas challenge paribas BNP paribas bank. This is the Amazon challenge that's 40 years old. That was Marios first challenge. He's now number two, he was previously number one at Kaggle, and he told us that he basically uses this over and over. When he joined us, several months ago, he said, let me see that this works on Amazon. From time to time we test that it works, but we can't actually overfit for this data set because it's not our goal. Our goal is to make robust features and we don't overfit every time we have a holdout prediction.
It's very good. It tells you this is the accuracy that we think it will be or new data, and usually that holds up pretty well. Avoiding overfitting is an important piece. In this case, we are in the top 5% out of the box, but if you'd add a little bit of magic that Marios knows, you can be in the first percentile. The magic is just to do one hot en coating and adding a linear model and stack the whole thing. It's not that complicated. Even that we will automate, of course, because it works for us. We'll put that into the next edition. It's always a little bit more work and suddenly it's more accurate. Who would've thought? The product also has automatic visualization. We'll go into that later. But the point I want to make is that this is visualization of big data and you don't lose outliers.
Basic Architecture
There's no sampling involved. There's no hex plots or something where you don't know exactly where the point went. The point is the point. If there is an outlier in the 10 billion row data set, it's still there in this plot. That's a very good technique that Lee Wilkinson developed and we implemented, he actually implemented most of it himself. This is MLI, machine learning interpretability. We'll get to it later in the hands-on session. Stay tuned. You want to know what your model is doing, otherwise you can't trust it, you can't deploy it. That will be a shame. You won't deploy a good model just because you can't interpret it. It's better to have a good model and have an approximate interpretation than an approximate model and know exactly what it's doing, at least in some cases. This should help you get there. Another important thing that our customers are telling us, especially the H2O customers that are used to having a Java production code story. Every time you build a model, you have a Java deployment package that is pure Java, nothing else, it just runs anywhere. The same thing will be true for driverless AI. All the munging, all the feature engineering, all the extra boosts models, pure Java deployment.
Scripts And Automation
This is code that we have, that works today. It's still in development. It will come out, basically next month in January. That's the architecture overall. As you can see in the middle, there's some kind of brain that's the driverless AI, but you can interact with it from many sites. So either you can tell it to do something either from the Gooey as we'll see, or from the Jupyter Notebook or Python script, that's the client API.
Or once it runs, you can look at it from your cell phone or whatever. Like I do that regularly, I just check it. It's basically just this client service system. The client can be just a web browser, so that's the Gooey, but you can control everything from that. You want to script it and automate it. So eventually you'll have scripts driving the system. That's why you want to have multiple boxes and, and, and multi-user management and all that stuff. We'll manage that. I'll tell you that in the roadmap slide. Let's assume you have a model like the ones I made yesterday for this Kaggle challenge. Once you have that model, you can deploy it. What does it mean? Well, if the new row comes in of that data set that has some strings and missing values and numbers and texts, fields and whatever that row will then get transformed into that improved data dataset that we made basically to feature engineer dataset and given to the model to make predictions.
The Modeling Pipeline
The whole thing is called the modeling pipeline. We're not just talking about models, but we are talking about the feature transformation and the model and maybe stacking of models all in one as a full pipeline. That pipeline is an individual artifact, one of which is the Java version as I mentioned. Others are Python versions. You can have a Java or as already a Python pickle. That's what we have right now. If you were to download it from the product, you will get a wrapped pickle, which means it's a Python state that can be loaded back into this exact environment. I can throw this pickle to my IT guy and, that department can just say, okay, a new brand new box, a Python environment, put the pickle in and say go. It'll score any new CSV row or batch of rows like a pandas frame.
There's no need for any GPBox or any runtime, anything, it's just Python at this point. But it should still be somewhat Linux. Because there is an XGBoost inside that has to be compiled. But we made it work for IBM, that we could have a native version that works on a Mac, but not many people will have enterprise deployments on a Mac. That's why you sometimes see us say, oh, okay, it's only Linux. let us know if you have the requirements that are strict, but it wouldn't be too hard to make it run everywhere. But especially with the Java one, it'll run everywhere. And if you have Java, you have Go, you have C whatever you want. There's also a thrift API. You can even use C Sharp or whatever that's pretty over the road. If you have questions, please just meet him or afterwards.
Remote Control Scoring System
Basically you can remote control the scoring system with a binary protocol anywhere. You can say, here I have a row, please score it. You send it over the wire. The other guy gets it, deserializes. It says, oh, this is the role. Okay, let me score it and send back the results. Either the transformed data or the actual prediction of the model. That's all there. Imagine you have a, let's say two gigabyte data set. You run all night on a GPU box that you just brand new, like an amazing box builds 500 models overnight. You get a great feature engineering pipeline. At the end of that, what do you do? You don't have just two gigabytes. You have 20 or a hundred gigabytes of data that you would like to score every night. We will have a Java, scoring logic right in January that you can put into Spark, for example, and you can score all these rows.
Future Refitting Of The Transformer
What if you want to retrain on big data? What if you wanted to fit that same pipeline on the 20 or 200 gigabyte data set? The refitting of the transformer that's on the roadmap for the end of next year. So roughly one year from now, maybe a little bit earlier, we should have sparkling water as a backend for driverless AI so that you can have big data transformations, but I don't think we'll recommend it to go to train only on Spark. You want to have kind of the fast rapid prototyping of figuring out what managing you need. Once you know what, then you run one time the fit on the spark, infrastructure with sparkling water at H2 GPMs and so on.
Endless Stream Of Ideas
You see the recipes coming in from the bottom. It's like a car manufacturing plant, right? There's an endless stream of ideas. When you go to Kaggle, you see hundreds of recipes. People have ideas that try this. Have you tried reinforcement learning? Have you tried the LSTMs? Have you tried GAN? Have you tried the extra-tree classifier? Have you tried this? And that percentiles calibration of some things. You can do all these things as a recipe. We want the data scientists to be empowered to write these recipes not in distributed Java, but in Python, right, or whatever they like or in R and then be able to provide these scripts and we run them somehow. We have to define that API clearly so that it's productive and scalable, but we don't want to over-engineer the thing and make it like a hundred percent generic, but it should be just right so that it'll make sense that you can build your building blocks and we make sure that it's still not overfitting and it's still fast and it still runs on GPUs.
Driverless AI Roadmap
That's what we are roughly thinking in terms of complexity coming up. So this is the roadmap. you will see that now is, well it's already there. We have a lot of features. We have the recipe of being good basically at Kaggle, like data sets that are not images. We don't do images right now. This is not a deep learning framework. There is deep learning insight, not in the version that you have, but next month's version will have deep learning insight for embedding categorical variables. Certain features you can derive or even for multi-class problems, when you have a hundred different classes, it's easier to build one neural net than a hundred times as many trees for a gradient boosting method. But it's not an image classifier program, okay? It handles texts. It can really have strings in it with words and explanations and it'll do its best to do text-based features, but it's not an LSTM yet.
That might be another recipe. we do prevent overfitting and leakage. We're very good about that, I would say. We also have other things like the visualization interpretation and this whole Python API, everything is Python right now at least from a user's point of view. The Java motor will come very soon. You'll see the multi GPU showing up more and more. Where you train one model, one XGBoost model on the whole DGX, let's say with 8GPUs. Where the data can be 50 gig distributed across the 8 GPUs and you can build one massive model where the data does not have to fit on each individual GPU. Because GPUs only have 16 gigs of RAM. We would say, oh, that's small data. But actually to train on 16 gigs is still a significant amount of work and you will not be able to train a thousand models to get to the best feature engineering pipeline. So there's a compromise today that if you want to have many models evaluated, then you can't train them in a way that is too slow. You just can't have it both ways unless you have a hundred DGX cluster. It's not realistic for most people. We recommend going to something like 5 million rows or whatever it is for your problem so that you can run overnight, get good results, and then draw conclusions from there.
Enterprise Features
Multi-user and data connectors, that's the enterprise features that we are working on. Having Hadoop connectors, database connectors, having security, you can log in when you log in, you know, the system knows who you are and then you also have a Kerberos token, then it knows what you have access to. You get only the certain data sets you're entitled to see. Your colleagues can use the same box but don't see each other's data and so on. That's all coming next quarter.
Time Series
Then a little bit more on time series. We already have time series support there. So we do causality, conserving machine learning if you want. We are not cheating by looking at the future and saying, oh yeah, now I know what to do tomorrow because I already saw it. No, we are only looking at the past, but we are not making all the lag features and all the cumulative sums and so on automatically, yet that's in the code base just not enabled. So that's a little bit more work for us to do. the user defined recipes or UDRs is I guess we can call them now. They would be very useful, right? If you can just say, oh, what about, random forest as a model because I like random forest. That's the only thing you want to change or you want to change your loss function or you want to change your, the way how do cross solidation, you know, it has to be this way and it can't be done any other way.
Sparkling Water
Well then sparkling water, as I mentioned earlier, that will be the big one. So if you go to docs.h2oai you'll get all the information you'll ever need for anything. Like it will tell you how to install driverless AI on your laptop. It'll tell you how to install it on all the cloud vendors. It'll tell you how to install it in the NVIDIA registry, right? If you have a DGX you have it in the registry of Nvidia, it will tell you all. What about the settings for the knobs when you start driverless? See there's a bunch of settings and this will explain it. So I think, we'll, we'll get to this right now. Who has it up and running already? Who cheated and clicked a little early? No, no, no. alright, let's see.
Running H2O AI
Arno Candel:
Well I did, okay, so does anybody still have problems getting this set up and running? Only a few? Okay, perfect. So it actually looks like we are good. Only one person has a problem that we call a statistical oddity. So let's scroll down. I agree with these terms. I'm sure you read it. Did you see the typo somewhere?
We'll fix that in the next release. So H2Oai or H2O doesn't matter, it's just some password. That is to make sure that you know who you are, no one else knows who you are. Now you need to have a license key and that you get from the instructions here. So you copy paste this blob, every single character here. Copy paste that over. Okay? So go back to the click labs instructions and you copy paste that piece. It's just one long string of some garbled characters. That's your private key that goes into this product, right into the gooey. It says enter license at the top, at the very top. You paste it in. Don't paste it in the bottom. Paste it at the top.
If you have an email with an actual license that you ask for earlier, then you can use your own license. Because that one is tied to you as a user, which gives you maybe more, permissions or something. And in this case, I think the only benefit you get is if you have problems, you can tell us who you are and we'll be able to identify you. But for now, this is fine. Like you just use whichever license you want. But if you go home, for example, you have to ask us for a license to get one. You can ask one, you'll get one for free, right? 30 days or something. No problem. We want you to give us feedback so you save this license key, it should go away. Now you're in a good mode where nothing is in there yet, but we are able to start who is ready at this point? Lots of people.
Add Data Set
Perfect. Let's do this. Let's start with adding a data set. Let's go to the data folder. Right now you can still go into every folder on that box and try to read plain text files that somebody wrote. We are going to make that more secure and, and not allow you to see everything right now. We give you full access to your own box, but you could argue that someone else could do the same thing. It's not yet secure and we know that it's just, not yet, finished. Go to data.
Go to Kaggle. Then go to credit card. You will see the path at the top. Get the training data. Say import at the bottom. The mouse has to go up and down a few times. If you don't know where to go, just go up or down and then we'll add another one. That's called a test.
This is a good principle. You have two data sets and if you give us both, we will never look at the test, just to be frank upfront. We will not use it. Even if it's a Kaggle problem, we will not use the test set unless the recipe shouts that out. Right now we don't, right now what we do is purely training on train. This 18th position earlier was training on train only, no test set, no distribution checker, no categorical agreement checker, no anything, no missing values, like imputation based on the global this and that. Nothing, there's no secret tricks. When we get the test set, it's the same result as if you don't get the test set, but still you have it here and we can actually pass it. Then at the very end when the model is done, it'll tell you what the score is on the test set, if it has a response in it.
If it doesn't have a response, you'll just get the predictions for it. You also get the munged test set. The munged test set is the test set that has the new features added to it and the bad features removed if you want. We make the good data set for you out of the data that you have that would be useful, for further training if you wanted or whatever you want to do with it. If you use this model, then we can do the scoring, but you can also refit outside. We'll give you a transformed training set as well, which has out of fault created, engineered features that are like not overfitting, but still valuable. The features in the training data set that got enriched are smarter than just, let's say a lookup table or something like it would be done on a test set.
They're doing batch style munging to give you an enriched data set that's better than a row by row transformation. You actually get a good data set out of this process. The training set that's enriched, you could actually throw this into another machine learning tool like H203 and train more. If you transform the training set as if it was a test set a row by row, you'll actually lose some information too, just to be clear about that. It's a small little distinction, but the act of turning your training set into a better training set has to be done on a full batch style all at once mode. It can't be done row by row, but the test set is always done row by row because it's like a real time streaming scoring system. There's a differentiation between transforming training and testing data. If you ever get confused, ask us and we'll write a blog about it because we have not done that yet, I think. You click on train, it'll show you visualize or predict. Let's do visualize real quick. And this is now telling this server, Hey, here's a data set. Find me the outliers, find me anything that's unusual, find me anything that I should know about and show it to me.
That's for the top part. The bottom part is always shown. It's basically some kind of overview of the data, but not just like a row by row printing of the numbers, but some kind of meaningful way. You can look at only spiky histograms, for example. I'll show you something that's spiky. There's only one because there is no other spiky histogram of anything against anything. Everything else is not spiky. It's good to know. Sometimes you'll see something unusual in here, and I think Leland is going to talk about this tomorrow. You'll go to his talk and you'll hear more about the things that he's written down here from his own words.
Okay? If there's people who couldn't open the GPU lab because it ran out, you have too many people attending you. There's another one now, a CPU version, which is basically the same. It doesn't matter how you run a CPU or GPU is fine as if you don't care about the speed. Now for the small data sets we are doing, this is the automatic visualization, so that's nice. You can look at outliers, you'll only see things that are actual outliers. See there's some, some points here that are interesting. The dot plots actually are fun because they're dropping the points exactly where they belong. It's not like a histogram where you have to make a same width bar somewhere. No, you're putting the point where it belongs. This one is definitely some kind of outlier you should know about and so on. It only shows you the relevant stuff. That's nice.
Building A Model Set
Okay, so let's build a model. Let's go to experiments and say new experiment. You can also do this from the data thing. Instead of visualize, you can say predict. Then you pick the training data set as a training data set and you pick the test data set here as a test data set just for later insights of how well we're doing. Now you have to select a target column. That's really all you have to do. I will pick the last one, which is the chance of me defaulting next month, is what I have to predict basically. This actual response has two levels, as you can see here that has two uniques. There's a yes and a no or a zero on the one. This says whether the person will default next month on their payment or not. That's what we are going to memorize. Now in terms of a model that makes the best guess given a person, will that person default next month? This is a credit card default prediction model, binary classification. We can leave the rest as defaults. There are no columns to drop. If you know that there's some columns in there that you cannot have, then you can drop them right then. The model will never see them as if you never had them. If you have an issue where the rows are not identically distributed and not from the same distribution, then it means that, and they're also not independent.
They could be from the same distribution, but they could be still dependent on each other. Then you would have to say, wait a second, I want to split the cross-validation folds where the same people are in the same fold. I'm not cheating as in that example earlier and that fault column you can provide. Here we have Mark Landry, by the way, glad to see him. He's the guy I mentioned earlier in the other talk about the person that brought data science to the company was basically Mark. So hands up for him.
If you have any questions about data science, he's very knowledgeable about all these little pitfalls. He has had many presentations about that and you should read his slides. They're very good. There's also a weight column and they all have tooltips. If you don't know what that, what's going on? As you look at the tooltips, the weight column tells you the observation weight. Each event, each row has a weight. If you want, you can provide a column that has that number. Now, I will say go, but first I'll take a different loss function for fun, like a log loss. You can do whichever one you want, but we also have a GPU here so we can enable it or you can disable it. If you click on reproducible, it will just make the experiment be the same every time you run it.
That means if you, if you want to know that it's going to be the same because you have a presentation tomorrow and you want your boss to see what's running live to be good or something you can try that it works with, with reproducible on. Then tomorrow will be exactly the same. If you don't do it, then it'll just be different. Now you'll still get the seed that was used at the experiment, and in the end, you will know which seed it was. If you have a Python script, you can rerun the experiment with that seed. You'll get exactly the same results. But from the gooey, you don't want to type in seeds. We just made it like on or off and this is a classification problem. I could say, no, I want to make a regression problem out of zero one. It's like my age. I want to predict how old I am or something.
It's not really a yes no, I know better. This is a real age, but it's kind of weird if it's zero one, right? It's binary. I think it's not a classification, it's not a regression problem, it's classification. Everything was auto detected just fine. I'll take the log loss and I will say cool. Now this is running as it's running. You'll see that it does something at the top. It says tuning back end. It did something already and now it's turning zero out of eight, two out of eight parameter tuning model. It's doing some parameter tuning at the beginning. First it figured out the backend, which means it figures out if GPU works or GPU has a problem and it runs on a CPU. Once it knows how to run, it will do some preliminary tuning, which is basically saying figure out how to transform my response.
If it's a regression problem, do I need to do a lock transform or square root transform or something. if it's a classification problem like in this case, it won't have to do that, it can just tune the XGBoost parameters. In this case it will tune different depth, different column samplings, the row sampling rates and so on the standard stuff you would normally tune. It already did that. Now it says it's starting with the feature engineering. It has a good model that it knows the data is reasonably well understood by this model. Now that we have a model, we will keep the model the same. The model parameters don't change except the number of trees we do. We do early stopping. Once it starts overfitting, we stop, we don't fix the number of trees. Even though everything else is fixed, we still let it converge until it's optimal as judged by an out of fault holdout set.
It's all fair. We're basically doing early stopping on our tree building and now we're building lots and lots of trees. Each model can have up to a thousand trees or so. We're building hundreds of these models. That's why GPUs are useful. Now this is a very small data set with 20,000 rows. if you're saying, oh, my CPU is actually faster than a GPU, yes, that's because it has to go over the PCI express bus back and forth just to put in 20,000 rows and then run the algorithm in like 10 milliseconds or something. Don't judge this by small data please. You can judge the accuracy, but even that, we don't have a small data recipe yet. If you have a hundred row data set and you say, oh look, this statistically is like my toy case, we might not do that well on that. Although as Lee pointed out the other day, he had two Gaussian blobs and metal dataset is very difficult not to overfit and we did not overfit. We do use reusable holdout and other principles to make sure we're not overfitting. In this case, you'll see now it's running two minutes, 11 points, iterations down the line. Who sees something like this on their screen just to make sure that we are all following here. Okay, great.
More Detail About The Test Model
Now we can tell you a little bit more about what's going on. The top sections are just giving an insight about the progress. But how many models so far? So 41 models and 764 features have been engineered and tested. The left side is the same as what you typed in the beginning. The right side is also what you said in the beginning, except that there are CPU and memory and GPU usage indications on the right there's also a trace. You can click on the little trace button, you'll see basically what's happening. Each green thing is a GPU model. Each gray thing is some kind of method call. You can mouse over and you see what's going on. It's basically some kind of a real time tool that gives you an inside of what's running. The red stuff is featured engineering. Every so often it makes some new features, then it builds a bunch of models and then it does new features, models, features models and so on. This is how to figure out what features are good.
In the bottom left as you see the chart with the performance metrics as judged by all days out of fault. We are not showing the metric of training on the training data. We are not scoring the model on the training data and saying, look how low the loss is. We are only showing you validation scores. Every score you ever see will be an out of sample estimate of how it's doing on unseen data, but it could be done either with like a third of the data or a fourfold or something. It's not consistent how exactly the number is obtained. It's just the best estimate. Even if it's a two third, one third split, you're still doing many, many scoring runs on samples of those data sets to give this reusable holdout, principle, a good run for its money.
They're not making one number, we're making many numbers and they're giving you to mean or whatever, we compute from it. It's a good number. That's all I want to say. It's not just one test score somewhere on some validation split. It's a good number. It's what we think. This is how well your model does so far. If you look here, we're down 30 iterations of this genetic algorithm. If you want to know more about the genetic algorithm, you can mouse over and explain everything. There is a variable importance chart here. You can see that the features are pretty, widely ranged. Here is this pay times pay. These two different pay amounts are multiplied. There's a cluster distance after we segmented these three payment columns into clusters, and then you compute a distance to cluster one.
This could be the rich people or the poor people or whatever you want to call it. You're supposed to not make these judgmental decisions in credit scoring or discriminate against people. This will tell you why this is not a neural net that says the weight was 7.3. This is telling you really what the feature is. We're reasonably well interpretable in this regard. There's frequency and coding, cluster target and coding. First we cluster the data, then we make groups. We say basically each cluster is these, these people belong together and then we compute the mean of some other columns in all these clusters. All of that is done with cross validations or it's all fair estimates. You don't want to have your own numbers computed with you in the sample. You want to compute it from other people that are like you. All these things have to be done just right and that's what the Kaggle’s are good at. Otherwise they wouldn't win it. That's the whole point that all these little things are built in just right. Does anybody else have similar results?
Features In A Finished Experiment
Great. Okay, perfect. When this experiment finishes, you'll actually see that it will also build an ensemble. Now you might wonder why 555? So what accuracy five? What does it mean? Well, if you clicked on the help description right next to the 555 button at the beginning, you can click now, but in, in the beginning and you're supposed to enter, there's a help and you click that, you go directly to the web page, it explains to you what the settings mean. Three accuracy, what does it mean? Seven, what does it mean roughly speaking more means we do a little bit more cross-validation, a little bit more genetic algorithm exploration, a little bit more repeats everywhere to make sure that when we say this is a better model, it actually is a better model. So more conservative, a little bit more time consuming. Even though you turn up the accuracy, not the time, it'll still take longer.
That's not how long it'll take the time. The time is a relative time. The accuracy makes it take longer. Obviously we know that. You can't have all high accuracy and fast, oh, I want everything. No, no, no, it's high accuracy means it'll take longer. The time is just a relative time and I suggest always using time equals 10. Then stopping myself, because you can press the finish button at any time. That's the same as if you had set, time equals two or three or something, except that time two means 20 iteration three means 30 iterations. If you manually stop at 17, there is no knob for that. You're not going to get repeatable results if you manually stop. That's why we have these settings. So you can say, oh, I always want to run overnight with like three time.
That's the amount of time I can tolerate. It's more like your time tolerance given a certain accuracy. Now the interpretability knob is not yet well exploited. We haven't done too many things. We have modernity constraints in GBM, but not unfortunately on the GPU implementation. It's not really enabled. We have no linear models. That would be a very interpretable model, but that's all coming. These will be new recipes that you'll see in the new year where certain interpretability settings will mean actual different models. Then we can focus more on feature engineering, less on model tuning, for example, of training. Now that it's done, you'll see a bunch of buttons here. You can either interpret this model, I'll do that in a second. You can score on another data set, which means give me predictions of credit card default probabilities for each individual person in that data set. That will come back as a download. You can transform the data set, you give your data and back comes the munged one with all these features that we consider important here. You can also get the training data set out of fold predictions. Now who knows what that is good for?
Yes, you can do more model training with those out of fold predictions. You can build a meta model on top of it that's called an ensemble, right? A stacking model for example. You can have, this could be one model. This whole black box could be one model and it makes a prediction on both training and testing. That's a new feature. You do these 10 times in a row, you get 10 columns, you throw another model on top and it says, oh, these 10 columns, what do I do for each individual? Suddenly that becomes powerful. You do this with different loss functions. In a four loop, you have 50 columns, you take a neural net on top, those 50 columns, you win Kaggle basically. It's not that hard. You can get the test predictions, that's what you want, usually. Then you can also just get the transform training and test data sets. Basically it's all there, right? You have logs, you can download the scoring package. The scoring package has both client server http, JSON, RPC, Thrift, everything in there, but also local node scoring. If you have a Linux system, it will make a pay virtual environment and run all the examples in it that are random data auto generated with the same schema as this dataset can actually show that to you.
Scoring package here, oh, it's 80 megabytes. See that's the whole transformers plus the whole models anti ensemble, everything is in there. If that finishes, I'll open it later. It's a bit slow right now. On the bottom right you'll see the summary of the experiments here. You'll see it was a data set with 24,000 rows. We predicted a certain column and this was the recipe. It did one internal holdout and it computed the scores on that one internal holdout. It split the data one time and it computed the score on that split, but not just one time, but many times and gave you a good estimate. It made 1,367 features. Who else has more features than me? Anybody has more. Who has less? Should be about even. Okay. perfect. We have some kind of distribution of features tested. If you add this all up, we'll have probably 20,000 or a hundred thousand features tested just now. It's pretty cool. Who has a better score on the test set than 0.404? Less than 0.404? Anyone? Okay, that's the power of luck, I guess. awesome.
You see here the iteration one model on the, in the training holdout was, 0.434 and at the end it was 0.430. Not a huge improvement, only four thousandths improvement in log loss. But it's a small data set. We only run it shortly. This data set is not very easy to squeeze information out. Only 16,000 rows. Once you see it you're done. Even the very first model is pretty good already. It's not just a measure of how well does it improve, but how good is it in the beginning. That's the real judgment. If you say, oh, my model is way better, then you shouldn't just think that it's better because it was better on your one holdout set or maybe your training metric or whatever. You have to be very careful that the estimate is a good estimate.
This iteration one estimate that I highlighted there is already a very well estimated number after we do the very smart feature engineering of target encoding already on the categoricals, already handled missing values already do all kinds of stuff with the transformation of the response if necessary. It's not like a dummy model, it's not like the poor man's model first baseline. We don't have a recipe for that. People are asking to say, we want to see how my model would've done or what would the random forest have done? We don't show that right now. We show only good stuff basically. But even the good stuff can improve if the genetic algorithm yields better feature transformers that are useful.
Why Does It Jump At The End?
Good question came up from the audience. Why does it jump at the end? Why does it fall down so much? Well, that's stacking, that's the final model here that took 96 seconds and that was nine models streams. It's eight models. The final meta model, the eight models are two different parameters. Each four fourfold validation, and then we make eight different predictions on the test set. Then the meta model puts them together. So this is already, in this case it's actually blending with a linear blender. Each of the two components has a weight and then they get put together. It's not half, half, it's whatever the, the waiter, the linear blender deems, right? We're not doing proper stacking right now, but that could be not a recipe that we do a neural net or extra trees or something at the end, depending on what customers are asking. I'm sure they're asking for higher accuracy is better usually. That's basically it. We, except that we haven't done the experimentation mode in interpretability and we have Patrick Hall here who authored the whole brains behind it. If you have a microphone, I would really love for both Mark and Patrick to give a quick two minute summary of their work and their findings about it. Can we make that microphone live here?
Good question. When you say finish, it'll jump to the end and then do the ensembling and the deployment package? Yes. We'll always get the deployable model. That's pretty accurate. You are just saying I'm happy with the feature engineering so far and you never know what's coming next at any point in time. It could jump a lot. You don't know that's the feature engineering for you unless you have like a base limit of the like what's in the data or something. Do we have a mic? Otherwise? I think you guys have to stand here.
Speaker 2:
Yeah, we're going to let them do just a few quick minutes. I know some of you really had a lot of trouble with the Quick Labs, but you can use that key, later on. When you're at home I know that we have a lot of questions, so I told Patrick and Mark they have to talk very fast and a few minutes so we can get some to some of the questions really quickly. Okay.
Patrick Hall:
Hi, so I'm Patrick and actually, Navdeep Gill and Mark Chan, who are, who are around, we'll be doing an hour long hands just on this tomorrow. I think that that's the biggest take home is, tomorrow afternoon we'll be focusing just on the interpretability aspect. If you're interested in that, please check it out. Just very briefly, our goal here is to be able to justify every decision that this black box machine learning model makes so that we can show which features contributed to each decision that the model made and to a certain extent how much each feature contributed to that decision. We do go so far as to attempt to write them out in plain English.
Here we're saying some, so specifically for this person, we're saying something like their most recent payment status pay UNDERBAR zero was two months late, and that resulted in their probability of default increasing by about 35%. We're trying to assign local variable importance for every single decision that the model makes so that from that we can build reason codes about why the black box model is behaving the way it's behaving. With that, I think I'll leave it off. I'll sort of give you guys that teaser for tomorrow. Come see you tomorrow afternoon if you're more interested in the interpretability stuff. I'll reiterate if qwiklabs didn't work for you, the key that you got in the instructions, you can download an evaluation copy and try it later. That key should work for you.
Arno Candel:
Yes, the best way is to get a key from our website. That way it's a longer lasting proper key. This is more like a toy key for this instruction here, but both will work for a few days or so, but the other one is the one that will long term be with you. Get the one from your official request on the website.
Mark Chan:
I'll also be quick, similarly tomorrow I'm doing something that really looks at, these features comparing some actual models I built. I took my own effort without driverless and then ran it with driverless and then once did both at the same time. Tomorrow two-ish, two 30, something like that. I'll be doing something about that. You can check that out where we start digging into some of these descriptions you see here, if you hover over it, you see a really nice description, We'll work to improve that, but already is really nice to me. It may take a little bit of getting used to some of the lingo that is in there and the documentation will actually help you. The documentation's really good. If you can read what target encoding is, there's some powerful things in there, that while you're using driverless we may assume you know so at least tell you what we're doing.
But the docs will help you if you need to get up to speed with some of that. Tomorrow we'll take a couple problems and tear those apart, look at the top features that are created and show that it actually tracks pretty well with some interesting concepts that I did myself. I looked at a problem shifted around a little bit and I was actually kind of impressed with what driverless was doing with some complex interactions. The other one thing is that I've found it to be very stable and it improves as we've done some. I ran one a couple days ago where I had not put in a time series, and it was very overfit. I got an inaccurate loss coming out of the screen and I came to check in with 1.09, not even today is one 1.10, but, and I thought everything was worse, but actually what it had done is it had done the proper thing. We've encoded some pretty interesting stuff in there. Dimitri spent a lot of time on stability of these, and I've found that that really works, especially for IAD data sets. But now we're adding some, some time holdout methods so that even if you don't know your data set is, but we detect that it is, we're going to help you build the right model, which is something that I'm personally passionate about. I will do a little more about that tomorrow. It sounded like we had a lot of questions. I'm not sure of the best format. We have about a minute 50 left.
Components Of H2O
Arno Candel:
There's a question about the different components. H2O GPU is a core piece to driverless AI because we are using GPUs. There's actually XGBoost, there's truncated SVDs, there's clustering, logistic aggression, linear regression, all these models are all implemented on GPUs now, and we are putting those into driverless so that everything runs a little bit faster. Deep water is more or less deprecated as Sri said. We're not really supporting it. If you want to do deep learning, we propose you use Keras or something like from Python where you're more expressive. If you want to just start and play with it, you're welcome to use H2o's Deep Water, which does have GPU support, but we are not really using Java to call TensorFlow c plus plus, right? We can just call it from Python if you need to. We will have TensorFlow in the product.
Where Can We See Transformations
Mark Chan:
There is a question on there of where we can see what the transformations are. Are they just linear and no, they're not. That's, if you, again, if the middle section right below the knob, that's telling you how percent complete, there's that feature importance. If you hover over each of those feature importance, you'll see that description again with the docs. These are not trivial transformations. They're really good. There's some complex interactions going on. So, yes. We do tell you that, and like Arno showed, we didn't quite download it, but you can actually download a data set that shows you the results of those transformations on both your training set and your testing set if you've provided one. Those are both important pieces of getting this done. and the system does tell you the output and it tries to describe to you what it did.
Arno Candel:
There's a file called features or text that you get in the scoring package. It explains all the features, has names of the feature columns, and then explains them in plain English. There's also a variable importance column. You can see which column is important. Some of them are zeros. That's because they're still included, even though they're not necessarily picked up by that one final GBM model. It's maybe picked up by the ensemble. We are not dropping all the columns. Don't be scared if there's zero sometimes. It's also an artifact of the genetic algorithm. You're bringing in a whole individual with a lot of genes in them, like a human. But not all genes are active and necessary. If you cut out a little gene somewhere, you're not going to just die. Basically, the point is that these genomes, they're mutating and all that, not all the pieces in there are a hundred percent necessary, but the individual that's the best in the end has a lot of good stuff. That's kind of how you should think. Now, this is the package, the scoring package, and you can look at it yourself, but it's all Python code. It's easy to go through and, basically download it, play with it. That's the takeaway message, and thanks for your attention.