Get Started with Driverless AI Recipes - Hands-on Training - #H2OWorld 2019 NYC
This video was recorded in NYC on October 22nd, 2019. Slides from the session can be viewed here.
Read the Full Transcript
Michelle:
Hello everyone. Okay, that’s on now. We’re going to go ahead and get started soon with the Make Your Own AI Recipes session. Step one is I’m going to have everyone go to Aquarium so if you were in the previous Driverless CI training earlier today you’ve already done this step but this will be the hardest step today. We’re going to go to Aquarium.h2o.ai and make an account. Once you’ve made an account we’ll go onto the next step but all we’re going to do right now is go here and make an account.
Arno:
Thanks Michelle. Who here has not yet made this account so that we know? A handful. Perfect. Just raise your hand if you need help and we can fix any problems. The goal is to make custom recipes today with Driverless AI. To let you know this is for ultimate freedom for data scientists which they seem to enjoy, right? Data scientists want to understand every piece of the puzzle and every time we do Kaggle we feel like wow, there’s another secret, secret sauce somewhere and we have a bunch of Kaggle Grand Masters around here so they can maybe help later explain their custom recipes that they have contributed and you will get an understanding for how you make such magic potion, if it’s not already in the system.
Michelle:
All right. Website there for everyone and then once you log in you should see this page. I’ll give you just a minute or two more and then we’ll do the next step. Don’t click anything yet. We’re trying to see this webpage. Can I get a TA in the front, please?
Arno:
Which lab are we going to click on? I think we’re at that point now.
Michelle:
Once you’ve successfully logged in what we’re going to be doing today is this session. David’s Driverless AI 1.8.0. You really want to click on this session. If you click on any of the other ones it will take about 10 minutes to start up. This one, if you click on, it will start up automatically and you will be happier.
Arno:
Who was in David’s earlier session? A lot of people.
Michelle:
If you were in an earlier session, that’s a really good point. If you were earlier today after you click view details you can go ahead and end your lab and start another one. We have enough and that will give you the full amount of time since we’re past two hours at this point.
Yeah. These instances will be on for two hours and at the end of two hours it’s going to turn off. Anything we do goes completely away. We can always start a new session which is what we’re going to do for those of you who were here earlier today. But we won’t have what you built last time. Tomorrow you can come back and do it again if you want. All right. Is anyone not yet clicked on this lab? Are we all here? Okay, we’ll give you a couple more minutes. I see some hands.
Arno:
One more side note is that these instances are very cheap, right? They’re Amazon throwaway instances that are probably four virtual cores or something, or eight virtual cores. Those are not necessarily the fastest systems to run Driverless experiments, so if you think that it’s not that fast it’s probably because the system is very slow. What you really want is SSD hard drives and you want mainly core CPUs, multi-core, multi-chip, multi-GPU, scale up basically. A fast, single server. That’s usually the best. If you have eight GPUs on it you can do 10s of high-torch NLT models on steroids. If you don’t have GPUs you can still fit Light GPM, actually is no problem, but anything NLP will become really slow unless you do the statistical methods that come out of the box. But any [inaudible] will be slow. Those you have to enable in the product. Once you enable them you really want to make sure you have GPUs, otherwise you’ll wait 100X longer.
Michelle:
All right. Hello to everyone that just walked in and joined us. What we’re going to do firstly before we jump into an overview of recipes is we’re going to go to Aquarium.h2o.ai, make an account, and once you’ve made an account you’re going to click view details on the lab that says David Driverless AI 1.8.0. Looks good. All right.
After you click view details there’s a place where you can start your lab and that’s what we’ll be doing in about 20 minutes after we start with an overview and explaining recipes and all of that. Is there anyone that has not yet made an account or found this lab? Raise your hand now or forever be lost. Okay, cool. We can go ahead and get started then.
What we’re going to talk about today in the make your own AI agenda, we’re going to talk about what custom recipes are, what BYOR is and how it fits into Driverless AI. Then we’ll actually talk about what recipes are. We’ll then together walk through a tutorial of how to use custom recipes. We also have a tutorial on how to write custom recipes which we can talk a little bit about today but probably won’t fit into the session. Then finally we’re going to have some of our experts on the stage to talk about how to write a recipe and give some examples of them.
At this point Arno, do you want to give us our intro to recipes? If not, I can keep going. You just mentioned it.
Arno:
I think you can do it just fine.
Michelle:
Okay.
Arno:
I’ll be here for questions.
Michelle:
Awesome, okay. As you’ve probably heard a lot today about Driverless AI it has many key features. This is a platform for automated machine learning that can run in any major cloud or on PRM. We do automatic feature engineering. We build visualizations for exploratory data analysis. We have machine learning interpretability so that you’re not building a black box but you’re actually building a model that you can really understand and explain to your business users. The types of use cases we’re doing are supervised machine learning, so this is things like classification. Will someone churn or not? Or maybe multiclass. Which group does this customer fit into? Then we also do regression and forecasting. So predicting amounts or understanding what are sales going to be in every store for every week for the next three months. In addition to these use cases we can handle text data and we can also take that final model, the Mojo, and deploy it just about anywhere.
What we’re going to really be focusing on today, at this point we’re assuming that you’ve seen Driverless before. You’ve probably used Driverless before and we’re going to be focusing on the bring your own recipes. This is the extensibility where you can add in your own feature engineering, your own models, or your own KPIs or scores in the Driverless AI to help shape and fit your use cases and data even better.
Driverless AI is an extensible platform that works on many use cases. We were built across industries. It wasn’t built for a specific type of industry or use case and there’s many different types of use cases that we’re helpful with. Some of them are here.
What recipes really helps with is letting you add in your domain expertise, often which might already be written in models or Python scripts, into Driverless. So you can add in your custom recipes to be tested on what we’re doing as well. So we have our highly tested, robust algorithms that work across industries along with your specific business knowledge on your domain.
This is the standard workflow for Driverless AI. We start by bringing in a tabular data set. This can come from basically any SQL data warehouse. Maybe your data is in the cloud, and we bring in the data set. We can do exploratory data analysis using AutoViz. This gives us the common statistical visualizations on every column and combination of columns in our data sets so we can understand that our data is as clean as we’re expecting and that there’s nothing suspicious in it.
The meat of Driverless is this automatic model building where we’re building new features on your data with automatic feature engineering. We’re testing many different algorithms and then we’re tuning those algorithms. This is done in a genetic algorithm where over many iterations different models compete against each other and the best ones win.
Finally, once you have your final model we have a couple things to understand that the model is robust. First, we have model documentation which is governance that allows you to understand what data is in the model, what features were used, what worked, what didn’t work. So you can really understand in the back end the technical experience that went on. We also have machine learning interpretability so you can understand how the model makes decisions, which features are most important, which features are important for each individual person in your data, and so forth. Then all of this is done into an automatic scoring pipeline which can deploy to Java, Python, C++, R, and this can easily fit into a restover, Java API, wherever you’re deploying models today.
What we’re going to be talking about is adding in the bring your own recipes. It sits inside step four, step three, where out of the box we have 4D feature engineering which does everything from taking out a month from a date because maybe seasonality is important, to truncated SVD which is sort of an unsupervised clustering to understand how many different columns relate together. Now you can add in your own feature transformations. Like maybe there’s a common transformation you have to do on your data on every data set you have that relates to your domain. Or maybe there’s a specific algorithm that works for your use case that you want to add in that maybe we don’t have. You can also add in your own score, so maybe you don’t want to optimize for F1, maybe you know exactly how much a true positive costs, or a false negative, and how much a false positive costs so you can optimize for dollars instead of the generic F1.
We’ve been talking about them a lot but exactly what is a recipe? Recipe is a part of the machine learning pipeline that helps us build models to solve a business problem. We have sort of four main types of recipes we’re using in driveless AI. We have transformations, which are done on the original data to create new columns. Some of this might be to clean up the data. Maybe there’s a certain way you want to handle a null or maybe there’s a type you need to take a date and change the time stamp somehow. But it’s making new features.
We also have the ability to add in data sets so this can be done in two ways. One, we have a new data recipe where you can use Python code to upload data directly into Driverless AI. We also can use data sets inside of transformations. For example, we have an example open source recipe which can take a ZIP code and look up the population density, school districts, and things related to ZIP codes. So it brings in new data about different ZIP codes in the US.
Another type of recipe is the algorithm that we’re actually modeling. In Driverless we have things like light GBM or XG boost but maybe you have a custom loss function you want to use with those, or maybe you just really, really like Scikit Learns Extra Trees, and you want to use that instead. Now you can add in those models. Then again we have the scores or KPIs so that you can optimize your problems for exactly what you want to.
In general, we call of these BYOR and these are bring your own recipes. You can add in your domain expertise into Driverless.
Okay, but why do you care about recipes? What recipes do is they give flexibility and customization to Driverless AI. We’ve built Driverless AI to be very robust across many different use cases but again, your domain expertise is really important in your data. Now you can add that in to recipes and to Driverless and we can test your transformations on your domain knowledge along with our generic algorithms.
We also have over, I think at this point, 115 recipes that we’ve built and we’ve open-sourced those on our public repo and we’ll look at that today. Many of these were curated by our Kaggle Grand Masters or Arno so that you have the best of the best as examples for how to write yours, but also the add into your experience. If you really want to use, say, Arema or Facebook Profit for your time series experiments we have open-sourced those and you can add them into Driverless AI. All right.
Then recipes are really easy to use and we’ll be talking about this today. We’ll all build an experiment today where we’ll add in some custom recipes and see how they’re used inside of our experiment. We do treat these as first-class citizens. So recipes you add in will be used just the way that recipes are used out of the box in Driverless, so they’re not treated any differently and they’ll go through the same competition that the recipes that come with Driverless do.
At this point we’re going to actually get started with our tutorial. I’m going to have … I’ll be walking you through the tutorial today but if you ever want to come back and do it again it’s going to be on our public tutorial page and the one we’ll be walking through today is Get Started with Open Source Custom Recipes. If you want to have this tab open to follow along on your own browser today you can. If not, I will just walk you through the experiments. All right.
We’re going to jump back to Aquarium. At this point I’m going to have everyone go to Aquarium and we’re going to click on that David’s Driverless AI, view details, and we will start our lab. I’ll give you just a couple minutes. If you have any problems with this please raise your hand. We have lots of people around the room that can help you get going. Then once you hit start lab there will be a link at the bottom and this is the link that you’re going to click and it will take you to your Driverless AI instance. We’ll give just a couple minutes for everyone to get to this page, which is going to load slow for me because live demo. There we go. This will be the page you see. When we all get here we’ll get started.
Pardon? Yeah. Here’s the link.
Arno:
You can also find it from the H2O AI website under docs and then tutorials.
Speaker 3:
Arno, can you change the color scheme of Driverless AI?
Arno:
The question was, can you change the font or color scheme of Driverless AI? Yes, you can. You can invert it if you want the black on white instead of white on black or something, it actually works too, but there is not really a full pallette where you can choose arbitrary skins.
Speaker 3:
Font?
Arno:
Not really. So you would have to … you can talk to our designers but, yes. The original design was very futuristic two years ago and now we kind of live with it. I’m sure we can make a custom version of the skin. It’s only a few settings, the colors are set and the font. Okay.
Michelle:
All right. Please raise your hand if you have reached this page. I have. Then opposite, if you have not reached this page. We did it. Good job guys.
To log in the user name and password is going to be training. I will zoom in so that’s actually viewable. I’ll go ahead and sign in. All right.
In this instance we have many preloaded data sets. Question? Can I have a friend in the front? Thank you. Okay. All right.
On this instance we have many preloaded data sets and many pre-ran experiments but we’re going to actually load in a new data set together. I’m going to have everyone click on the add data set button and we’re going to click on file system. All right. There’s lots of options when you click file system. There we go. You should see this and then we’re all going to click on data and it’s going to load super slowly because you’re all watching my screen. Okay. I’m going to … there we go. Got it. We’re going to click on data. We’re going to click on splunk and we’re going to choose churn.
Once you’ve selected churn you can go ahead and click import selection and that will load in the data set and it will also do some summary statistics so we can understand what’s in the data set. That’s data, splunk, churn. All right.
The first thing, many of us probably did this earlier in the hands-on training but we’re going to really quickly look at the data set so we know what we’re modeling today because that’s pretty important. You’re going to click on the churn data set and we’re going to click on details. All right. I like to start with the data set rows. It’s kind of readable up here.
What we’re looking at here is each row is a person who is a customer of a Telco. We have information about how often they speak during the day, how often they’re charged in the morning or the evening. If they do any international calling. How often they’ve called customer service because they had some sort of problem. Then what we’re going to be predicting, the column on the very end, is churn. Some of these people have left the company and some have not.
I’m going to click back on data set overview and here we can see every column in our data set, so I’m going to search for my churn column and here I can see that it’s a little bit of an imbalanced data set. 14% of my customers have churned. What we’re going to do today is build some models to see if we can understand which customers are going to churn and which ones won’t so that in the future as new customers come in or each month we can see whose likely to leave us. All right.
We’re going to start with no custom recipes and just run an experiment. To do that we’re going to click for actions and go to predict. All right. The first thing I’m going to do is click display name. I’m going to name my model. This is going to be our Baseline Model. You can name it whatever you want but I’m going to call it baseline because we’re not going to add in any custom recipes. We’re going to just let Driverless do it’s own thing and then we’ll add in some custom recipes and compare.
Remember that Driverless is supervised machine learning so the only thing you have to do is tell it what you want to predict. What we want to predict is churn. Now we have many default settings that have been set for us. Is everyone at this page? Anyone need me to slow down a little bit? All right. I’m going to take that as a safe …
Arno:
You’ll notice we’ll switch to log loss instead of AOC out of the box because we know it’s imbalanced and it’s one of the things that’s our freedom right now but you can override that with your custom score or you can pick a different score yourself. It’s just our defaults and those are sometimes changing from month to month without us documenting everything. Apologize if it’s not always obvious what happens but we do our best to make the models better out of the box.
Michelle:
Perfect. Yeah?
Speaker 4:
[inaudible]
Michelle:
Great question. There’s two ways that we split the data in experiments. One of them is while we’re building many models, because that’s what an experiment does. We build hundreds of models. That’s automatically going to do cross-validation. We’ll actually see here in the, “What’s happening?” There we go. This is going to tell me the … in my experiment it’s going to do three-fold cross validation. But I often am going to want a holdout test data set at the end that wasn’t used at all in the experiment so I can see that my model is actually robust. For that, that part’s not done automatically but we can click on data sets and when we click on a data set there’s an option to split. So here we can split either stratified split if it’s like a binary classification or you can do a time split if it’s forecasting. Yeah.
Today we’re doing kind of a small data set and we’re focusing on recipes so I’m not going to split it, but if we were doing a real business case we totally would. Yeah. The question was, does Driverless automatically split data?
All right. I’m naming my model Baseline and we’re predicting churn. All right. We’re going to change some of these settings down a little bit just for runtime so we can run a couple experiments today. I’m going to drop my accuracy down to four and my time to one. All right. Over on this side we’re going to see all the things that are going to happen. Everything here are models and feature engineering that comes out of the box. We have GLM, GBM, XG Boost, and then these are feature transformations that come with Driverless. In our second experiment we’ll run is we’ll add in some feature engineering and we’ll see them show up here.
At this point we’ll go ahead and click launch experiment and have everyone start running their baseline model. Driverless gives us some nice notifications at the bottom. We see that we have a little bit of an imbalanced data set. We saw that in our data details page. Only 14% of our data churned which is good for our Telco. But we are given the option to turn on imbalanced models if we want to, so that’s a nice new feature of driveless. But I’m going to close my notifications and we can see a model. We have our first model built and in the center here we can see the features. So our top features, our original features. How much someone is being charged, which isn’t too shocking. People are leaving the Telco because they don’t like how much money they’re spending. But it’s nice that Driverless is seeing that and validating our super-good Telco business knowledge.
We can also see a little bit of engineered features here. We’re looking at customer service calls and we’re looking at the interaction between how people talk on the phone during the day and what they’re getting … how they talk in the evening.
We can see several models built. Oops.
Arno:
One quick comment.
Michelle:
Yes.
Arno:
All the numbers here we see, the validation performance, the numbers there, they are always out of hold, holdout data. So it’s never on the data that it was trained on. It’s always on unseen data. We’ll give you an estimate of how the model will do on new data and we’ll even give you an error bar, so in the bottom left you always have a little error bar. That’s basically the bootstrap variance or the standard deviation of the bootstrap meaning we basically do a bunch of repeated scorings on unseen data and we’ll give you the distribution of that. It’s our best effort to give you the estimate of the number. Sometimes you’ll see a confusion matrix with floating point values, that’s also because it’s the best estimate of the number. You don’t need to necessarily use integers just because the data has integers. If that’s our best estimate we’ll give you that. All numbers are always on unseen data.
Michelle:
Thank you. Yeah. All right. At this point hopefully everyone is at about 100%. No? We’re not at 100%. Are we running the experiment?
Speaker 5:
71%
Michelle:
It’s trying hard. That’s almost 100. We’ll round up. We’re going to now all jump to the GitHub repo where we’re going to get some of the recipes we’ll be using today. I’m going to open up a new tab and we’re going to go to GitHub and we’re going to search for H2OAI Driverless recipes. That’s kind of hard to see, isn’t it? There you go.
All right, before I click raise your hand if you have clicked here and you know the link and you’re all good to go. Okay, about half. I will wait. All right. Let’s try again. Who has made it to the Driverless AI recipes repo? And who has not? All right. Still waiting. I’ll be patient.
Arno:
[inaudible]
Michelle:
To Driverless? Do you want me to zoom somewhere? What do you need? I went to a new tab and I went to GitHub.com. Yeah, yeah, yeah, yeah. Good point.
Arno:
In case you haven’t seen the other repositories there’s H2O repository itself, right, and then there’s also data table. That’s the Python data table repository so in Driverless we’re using Python data table as our workhorse for our data managing. That’s a very good tool if you want to use just a Pandas replacement for fast grouping, sorting, joining, file reading, file writing. It can write CSVs faster than other tools can write binary. It’s really good and we use it for Driverless AI. And it’s totally open source so that’s the H2O AI data table repository but in this case we’re looking at the recipes.
Michelle:
What was the question?
Speaker 6:
[inaudible]
Arno:
I was saying data table is faster than Pandas because it’s a recent development in C++. Highly optimized and some of these recipes that we’re showing you will use data table as the basic tool, basically.
Michelle:
It’s in Python, as well.
Arno:
There is an R and a Python version and both of the main developers for either are working with H2O. Yeah.
Michelle:
Okay, all right. Once we’ve made it to the Driverless AI recipes, here on the home page the readme is very thorough. It explains basically everything I’ve told you today. What are recipes? How to use them, why you care. But the part we like, just go down to the bottom and we can see the sample recipes. Actually I’m going to have everyone scroll back up and in case you haven’t used GitHub before one of the benefits of GitHub is you can have different repositories, so we’re all running Driverless AI today, 1.8.0, but there are many versions of Driverless. So there’s many versions of different recipes. What I’m going to have everyone do is click on the branch button inside of Driverless recipes and you’ll search for 1.8.0. Whenever we use recipes we want to make sure the recipes we’re using have actually been tested inside the version of Driverless we’re using. If you were running in 1.7 you’ll use 1.7 recipes or 1.8 you’ll use 1.8 recipes. I know that’s hard to see. Sorry. There we go.
Arno:
There’s also a link down in the readme page to get to that specific branch in case you don’t know how to find that. There shouldn’t be too many branches.
Michelle:
Awesome. All right. Once we have clicked on 1.8.0 we’ll see for 1.8.0 there are 107 recipes and the latest branch is 115, so we’re always adding new ones. What we’re going to add first today is a feature transformer. We’re going to scroll down to the bottom, to transformers, and the feature we’re going to add today, so out of the box Driverless has a feature called interactions that will look at mathematical operations between two columns. We actually saw that in our model. There it is. We added two columns together and that was a new feature that was important in our model. But one thing you might have noticed about this data set is we have day, evening, and night.
So there are multiple columns that it might make sense to add together. What we’re going to build together is a transformer which takes three or more columns and adds them. It could also add, subtract, multiply. Well, you’d have to think about how subtract would work but it can add or multiply. But we’re going to add in a sum transformer so we’ll add three or more columns and then we’ll run that on our data set.
We’re going to scroll down to numeric because we are going to be augmenting our number data. This is transformers, numeric, and then we will all choose sum.
A little bit later today we’ll probably walk through this code and talk about sort of what’s happening but at this point all we’re going to do is take this recipe and put it in Driverless. You get to trust me that it takes three or more columns and adds them together. To do that we’re going to click this button that says raw and we’re going to copy the URL at the top. All right. I can come back here for anyone that needs to but the next step, so we’re just getting this URL and this is a way to link to recipes that are, say, in GitHub or on the web somehow.
There’s two ways to upload recipes. One is you can give a link to wherever it’s stored. The other is if it’s on your local machine you can just upload it. If you’re going to start building recipes one of the things you might want to do is download this GitHub repository and then maybe change some of them or use them straight from your computer. But today we’re going to just upload them from this link. We’ll copy this link and then we’ll go back to our Driverless AI instance. We’re going to click on experiments and we’re going to have our baseline experiment here. I’m going to click the three dots on side and I’m going to say new model with same parameters.
If you remember we changed the dials so that we would run really fast. So that we don’t have to remember what those are we’ll just click new model, same parameters, and then we’re going to get the same dials. If we had changed any expert settings would get the same expert settings, too. If you’re making lots of changes for your experiments this is a nice way to not have to remember what it all is.
Yes?
Speaker 7:
[inaudible]
Michelle:
Yeah, we haven’t done it yet. I promise, we’re getting there. Yeah.
Speaker 7:
[inaudible]
Michelle:
Yeah. To experiments? Yes. I’m going to click on the baseline experiment. This will be named whatever you named yours. I named mine baseline. I’m going to click new model with same parameters.
Arno:
And this will use the feature brain that was saved from last time. Next time the model will already be smart and use whatever it learned from the first experiment because in the expert settings the feature brain level was not set to zero but two, which is the default. That means that if you found good features it will retry those to start with. But obviously if you have new transformers like our great special transformer Michelle just added then maybe that that guy will be, we’ll see.
Michelle:
Good, all right. Does everyone see this familiar experiment screen? It should be basically identical to what we saw earlier except for the display name will be 1.Baseline. Raise your hand if you’re not here? Okay, awesome. Okay.
Speaker 8:
Here’s the issue with Internet Explorer.
Michelle:
Okay, okay.
Speaker 8:
It takes forever.
Michelle:
Okay. If you want to just set up a new experiment, Joe, it’s 418 on the settings. Okay, all right.
The first thing I’m going to do just so I remember what I did in this. I’m going to change the display name from 1.Baseline to some transformer. Because that’s what we’re going to add to this one. To do that, to add recipes, so that’s our question, we’re going to click expert settings and we have three options at the top. The option we’re going to use is load custom recipe from URL. We’ll click this and this is where we will paste our URL.
Yes. Expert settings, which is right above the score. Expert settings so that’s expert settings.
Speaker 9:
These custom recipes, are all this Python or they can be unrelated?
Michelle:
At this time they are Python. If you have something in like … The question was do custom recipes have to be in Python? If you have something in like Go or R if it’s wrapped in Python, that’s fine. Arno can answer that better. But basically there’s Python involved.
Arno:
You basically reveal or plain source code and if it’s Scikit Learn or Pandas you don’t need to any imports, whatever, but if you have something extraordinary, some Facebook Profit or something, you have to be able to specify those dependencies. We can say Facebook Profit, these versions, and then it will do an install for you on the platform while it’s running. So you don’t even have to restart or anything. You just upload the plugin. But it definite is Python but you don’t have to only use Python. You can call anything. You can make a rest call to Google’s auto-email and say you decide what I should do? Right, roughly speaking. But then the production pipeline will be a little weird because that also will make those rest calls, right? So we have to think about how far you want to go with it. If you just want to play around, you can do whatever. If you want to productionize it then you have to be more conservative.
Michelle:
Awesome, all right. Let’s go ahead and add in that recipe. Again, that’s going to be expert settings and then the center option, load custom recipe from URL. All right. Then I will paste in my recipe and hit save.
What you’ll see currently is it went really fast on this recipe but there was a quick little wheel where acceptance tests were being ran. When you add in, say, a model, when you write a model you can say what types of problems it’s for. So maybe you write a model that’s only for binary classification. We will do a quick test on small data to see how it goes. So we’re testing your transformers, your scores, and your models for best practices like you don’t want a transformer that changes the number of rows in your data set or something like that. Those safety checks are built into Driverless and it will give you mostly helpful error messages if something fishy goes on. Good. All right.
We all added in our custom recipe so we’ll go ahead and click save. Awesome. Then I’m going to zoom in and you’ll see now that sum is a list of one of the features that it’s going to try. This is now a list of one of the features that I might try in my experiment. One thing, Arno, I’m going to let you talk about this one. There’s a new option expert settings for official recipes.
Arno:
That’s just a link to GitHub.
Michelle:
It doesn’t … I thought it loaded all of them.
Arno:
This we could have used to go to that website directly, to the right branch.
Michelle:
Got it.
Arno:
It’s a shortcut.
Michelle:
So you don’t have to know what version you’re in. Perfect. We did it the hard way but there’s an easy way. The fun thing about Driverless is there’s new features all the time. It’s always getting better.
Arno:
One thing quickly to show maybe under recipes there’s a pull-down where you can choose which transformers.
Michelle:
Yes.
Arno:
Just in case you want to make sure that that sum is enabled it will actually show up yellow there as one of the enabled ones. You can disable specific ones and in the next version you will also be able to specify which column gets which transformers. Sometimes you don’t want age to be moved, let’s say. You want it to be purely numerical, not categorical. Then you can say, do not do any binning on it, for example. You can specific per column what transformers to allow but not in this version.
Michelle:
Awesome, great. Again, the location for that, so if you go in expert settings to recipes there in transformers, models, and scores you can turn on or off any recipe including custom ones we just uploaded. We just uploaded the sum transformer so it’s available on the system now. Maybe for fun we want to try an experiment where we only use the recipe that we just built to see how it does on our model. We probably wouldn’t use that for sum but maybe for your business domain knowledge you might want to test only your model or only your transformation and so you could do that.
Arno:
There’s the original transformer and then there’s some cat original ones, so the ones that are categorical in numeric columns you can say just use as is, don’t change them. But then the model has to be able to handle categoricals natively, right? Which Light GBM can do but [inaudible] cannot so we will only accept what makes sense. We are smart enough to not let you screw up our models but if you have your own models then you have a Boolean that says, I accept the non-numeric input true or false. By default it will say no, it has to be all numeric. But if you have a factorization machine or an FTRL or something or something with a hashing that takes any input, any strings, and then turns it into a vector inside then you can say yes, I can do non-numeric data. That’s it.
So there’s all these configuration options and in GitHub, as hopefully we’ll see later maybe, there’s some templates that show the source code of the actual custom transformers that’s all documented, all the fields that you can set in Python are documented. Let us know if it’s not clear or if you have questions but we did our best to talk you in as much as we can. This is only the very simplest example. And maybe we can look at the source code too for that sum transformer while it’s running.
Michelle:
Absolutely. All right. For time today so we can jump into some of the source code soon, we’re going to go ahead and add in a score and a model now so we can see how that works and then we’ll run one experiment that has three different custom recipes we added in. I’m going to have everyone jump back to their GitHub and you’ll probably see the raw link that you copied. You can just click back and that will take you back here.
This time I’m going to go back to recipes because I don’t want anymore transformers. We’re good with our transformers. We will go ahead and choose a score. This time instead of scrolling down the readme I’m going to actually just jump into the type of score I want. We’re going to all click scores and then we’re doing a binary classification problem so we don’t need any regression scores, we just need classification scores. We have this all sorted so you can kind of find what you’re looking for. Then again we want binary classification problems.
We have a couple custom recipes you can choose from. The one that we’ll use together today is false discovery rate. In a way this is basically type one error or the opposite of precision, but it’s basically how often we are flagging things wrong. We’ll click on false discovery rate, FDR. That process is to click raw. Raise your hand if you haven’t made it to this place in the repo. Awesome, we’re doing good. We’ll click raw, we’ll copy the link, and we’ll jump back to Driverless. All right.
In the expert settings same exact step. Load custom recipe from URL. We will paste our recipe. We’ll run some acceptance tests that run really fast and we’ll hit save. Now if we want to add this new score that we just added in we’re going to click the scores button, which is currently log loss, and we’ll see that FDR is there. That’s our false discovery rate that we just added in. Now this model won’t directly compare to the other because we’re switching scoring metrics but we can optimize for basically any business metric or KPI that you need to in your business. All right. Then the last one, was everyone able to add an FDR? Roosevelt’s in? All right, cool.
We’ll go back to the GitHub and click the back link. Go back to recipes and this time we’re going to choose a model. We’re all going to click on models. The type of model we’re going to do today, so we have some time series models which is going to be like ARIMA or Facebook’s Profit. We have some NLP models so that you can do modeling specifically related to text. And then we have custom loss so this is changing the loss function for things like XG boost, but we’re going to go into algorithms and just add in a completely new algorithm.
There’s several that we could choose from. Maybe you really like CatBoost or you’ve seen that KNN works for your use case. But what we’re going to do today is Extra Trees. This comes out of Scikit Learn. It’s just using Scikit Learn’s Extra Trees inside of Driverless AI. So we’ll click on that one and get our URL link. That’s Extra Trees.
I’m going to go back to Driverless in my expert settings. Click load custom recipe and add in my Extra Trees. This one will probably take a little bit longer and we can actually see it testing, so it’s seeing that Extra Trees works for regression, it’s making sure it works for binary classification. Whether or not the binary classification is a number or category. We’re just going through and making sure that the recipe is actually going to work and not crash in the middle of an experiment. All right. Once the acceptance tests have passed we’ll click save and then we can again zoom into our, what do these settings mean and there’s our good pal, Extra Trees.
Again, like Arno was saying, if we wanted we could then go into expert settings and maybe turn off GLM or Light GBM because we only want to test Extra Trees, or maybe we decided I uploaded this recipe but I changed my mind. I actually don’t want it anymore. So we can go back into the recipes and unselect it. All right.
At this point we have … I’m going to change my display name because it’s not true anymore and I’m just going to say that this is BYOR. Because we have a custom score, a custom algorithm, and a custom feature engineering. Then just to be super clear, we could absolutely add more than one for each of these. Like if you wanted you could add everything from the GitHub but it would probably run a little bit of a long time.
Arno:
Yes, you can actually provide the whole GitHub root directory so just GitHub.com/H2AIrecipes/ and then post that there and then it will upload everything which will give you a lot of choice, but it will also take maybe 20 minutes or so to test everything.
Michelle:
Yeah, all right. Raise your hand if you’ve made it to this page with all your custom recipes? Awesome, looking good. Now we’ll go ahead and click launch experiment.
As this runs we’ll see a couple different algorithms popping up here, including our Extra Trees. I don’t need the notes right now. We’ll probably see our sums feature engineering pop up and then all the scores we’ll see will actually be false discovery rate. So it’s going to be different than our other models which was log loss but we’ll be looking at the false discovery rate there. While we give that a few minutes to run … it’s actually starting. There’s our false discovery rate so far and currently we have an XG Boost GBM but some of these models would be Extra Tree. All right. Here it goes. We’re tuning. At the top it just said tuning Extra Trees so it should be there soon.
There’s our Extra Tree model. It’s currently not doing as well as just XG Boost GBM and that’s sometimes the case. Maybe the recipes we add in aren’t going to do as well as what’s just in Driverless, especially if it’s maybe not domain specific, we’re just trying another algorithm. But this would be a way to test out and see how it works. It looks like we’re almost at our final model and XG Boost is still winning. Yeah.
At this point we’re going to jump back to the GitHub and talk about, a little bit about how do you start writing recipes and we’ll do that a little quick because we’re close on time today. But then we can take some questions, too. All right.
Okay. So in addition to consuming recipes, so today we all consumed recipes. We took some of the 115 that are publicly available and used them but you might want to start writing your own. Take your business knowledge, which maybe is already written in Python, but write it so that it can run in Driverless and then start testing that as well. We sort of wanted to go over quickly the process. The process of writing recipes, it’s usually good to start with a sample data set in something like PyCharm or maybe you really like Jupyter Notebooks and build the recipe outside of Driverless. Just test the algorithm you want to do works.
My example, I started with my sum transformer which takes three or more columns so I just started with a Pandas dataframe which had seven columns and then looked at how do I add those together? That’s pretty straightforward in Pandas. It’s dataframe.sum but, you know, for more advanced recipes you might have to do a little bit more debugging.
We next recommend downloading the Driverless AI recipes repository and it has all the examples. Then you get all 115 so some of those might be similar to what you’re trying to do and you can start to understand how recipes work and see some example code there.
We also have the recipe templates which we’ll look at really quickly today and this is basically a template for models, feature engineering, and scores. It gives you every possible option and explains them pretty well with good comments on what they do and why you’d include it and how to write it. Probably the best place I would say, write your code so that it works just in Python in general and then go to a template and add it into the template.
What it takes to actually write a custom recipe. The only thing you need is a text editor, so somewhere to write Python. I would recommend maybe doing something like PyCharm. PyCharm is really nice because you can have, if you download the whole Driverless repository, you can have all … are we done with pictures? Yeah. Okay, I’ll show you. You can have all the … It’s not showing you. There we go. You can have all the recipes on the side so then this is the full Driverless AI repository and so if I am writing a model I might be able to open every model and go look at how they work. That can be really helpful. Then you can also debug in PyCharm. All right.
That’s really the only thing that you need. Other than a Driverless AI instance to actually run it on. Bring your own recipes are for 1.7 and later so if you have 1.6 or 1.5 recipes won’t work there. But the 1.7 series and the 1.8 series, which is the latest release, is where you can actually run your recipes. To actually test your code locally it’s Python 3.6 and then data table and the Driverless AI Python client. Then to write a recipe you need to know a little bit of Python code, so that’s sort of the skillset required is … yes?
Speaker 10:
Will Python 2.7 work or 3.6 works?
Michelle:
Just 3.6 at this point.
Speaker 10:
Thank you.
Michelle:
Yeah, all right. Then one more thing is how to actually test recipes. The easiest way to test if your recipe is working or not is to upload it to driveless AI where we run the acceptance tests and it will return an error message saying why your recipe didn’t work. Maybe it returned less rows than it started with which a transformation can’t do. Or there is some other issue. Maybe it was you’re writing a model which uses PyTorch and it wasn’t able to install it or something like that.
The next way would be to use the Driverless AI Python and R client to sort of automate this process. We have some example scripts on the GitHub of how to do this in Python, but basically it’s a script that you can say what the data set you want to use is and the transformation location on your computer and then it will just upload a data set, upload the transformer, turn everything off but that, and then return any error messages. So that’s a nice way you don’t have to point and click and upload, you can just make changes and quickly iterate.
Then if you want to test locally we are going to extend the base class of the recipes, which is what we’ll look at and they’re a dummy version so you can actually run that on your local machine without Driverless. Those are the three main ways for testing.
Then the last thing before we jump into showing you where the templates are is what happens if you get stuck writing custom recipes? There’s a couple things you can do. Error messages and stack traces from Driverless AI. If you look in the logs or if your recipe doesn’t pass an acceptance test it will give you the full stack trace there. Sometimes that helps with pinpointing what’s going on. But maybe you get that full log and it doesn’t mean anything to you and what next? You can take the logs. First you can look on the FAQ in the templates to see if there’s similar issues to your recipe, maybe that gives you a direction.
Actually it’s not coming soon anymore. There is a tutorial now. Today we talked through the tutorial of how to use recipes but the second tutorial I’ll point you to is to how to actually write them and it will write the three recipes we used today. The sum transformer, FDR, and Extra Trees. You can actually look at maybe there’s a step we’re missing. Then finally, we have a Slack community channel so you can always join the Slack community channel and our experts are on the channel to answer questions so you can add logs and we can tell you what’s going on there.
Arno:
Just one last comment. You can use data table if you’re interested in learning something new or you can use Pandas or NumPy. So it can do X2 Pandas or X2 NumPy and then you are in old-school country where everything is as expected. But if you want to learn how data table works and see how fast it is you can just stay in data table and learn that new API. One of your Kaggle Grand Masters has ported a lot of code to data tables so most of our examples will actually be in data table because that’s the fastest way to handle scalable big data. You don’t want to do two Pandas too often unless you have to because data table is multithreaded and Pandas is not.
Michelle:
Perfect, great. The last thing I want to tell you about before we jump into … one more thing is inside of models or transformers or scores there is a template file. For example, for the model template file this is the base class. It has a lot of commented code and it’s everything that could possibly be in a model. Every option. So there’s things like we can say if our model is allowed to be regression or binary classification or multiclass you can turn those on and off. It tells you how to actually write the fit function, the predict function. We’re not going to walk through this today for time but I’d absolutely check out the templates. They’re a really good place to start and you can just copy the template and then just change the things you need to to write your recipes.
At this point, we’ve talked sort of about some basic transformers. We added some columns, but in addition we can do custom recipes that are for very specific verticalized industries. We’re going to actually talk about some of that for a few minutes. Yes, let me get the slides for you. It’s going to be just a second. Do you want to chat while I’m … do you want to chat while I’m getting them? Okay. Yes, I got them.
Speaker 11:
I think part of the thing is that folks have been here for a couple hours. You’ve learned about the basics of Driverless and then going into what are recipes, how to use them, and so what we want to show next is the art of the possible, and so more advanced recipes and what can be done. So kind of give you guys pictures, learning a little bit about Driverless, learning about recipes, and how far can you go with something like this.
Speaker 12:
Hey. I’m guessing you guys can Hear me. She just worked it. I’ll take you through three different use cases that I worked on where we’ve used custom recipes. One, of course, is anti-money laundering, the other one being malicious domains, and the other one is something which is called DDoS Detection. It’s distributed denial of service attacks in the field of cyber security. The first one, that’s up in the area of money laundering and fraud and that too comes up in the field of cyber security. What Michelle actually showcased is once specific transformer that you guys have used to try and transform the data in a certain way so that it fits their model. What we’ve done is we’ve taken use cases where we’ve built extensive sets of features, like about hundreds of features that are specifically designed for specifically designed for specific use cases and in this case the first one of course I’m talking about is the anti-money laundering one.
What we very simply tried to do is tried to reduce the false positives that exist in the system. Essentially we have a bunch of alerts which say that they are actually money laundering. Not all of them are actually money laundering and one of the big problems is this is a global problem. You have about 75% to 99% of these alerts that are actually false positives. Essentially it’s not a big deal to have these false positive alerts but it does cost you. I mean, the manpower that’s actually required for it is quite extensive. Which is why reducing it actually helps a lot.
For that we actually built an entire module, a package, which purely solves anti-money laundering and this package has different custom transformers that are very specific to this behavior. Using that we are able to probably come across … We’ve actually reduced the false positive rate by over 94% and this is essentially how you can think about using the custom recipes as well. You can have your data scientists, most of you guys here are also data scientists yourselves so you can start building features as a part of Driverless AI which stacks to add into the larger set, comprehensive set of features, or transformers for that matter, which eventually become an entire package for a whole particular solution if you’re looking at it that way.
The second example that I want … by the way, please let me know if you guys have any questions.
The second example that I wanted to talk to you about is something that is malicious domains. This essentially is a very big problem that exists globally in the field of cyber security. What happens is there are these really large number of dynamic domains that come up that attackers seem to use as a means of trying to get you to phish, trying to basically get you … try to launch an attack on you, your organization, any of these things. These are usually vectors for DDoS as well. One of the things that the models need to do is try and detect that this domain that has actually come to you through your email or something that came on a website or any of these things is actually malicious. Essentially, there is literally, probably like 300 milliseconds or something often amount of time that the system has from the time you click the link to be verified whether it’s malicious or not and then come back to you saying, okay, this link is good or this link is bad. Essentially, that’s the whole process.
The model that we built looks at different aspects of the domain, it looks at the health of the domain, the longevity. When was it registered? How was it registered? What’s the formation of the domain? What kind of location is it at? It looks at many different features in this case and then it’s able to classify whether this domain is good or bad. That essentially helps feed the model when you and your organization are going ahead and clicking a simple link. The simple process of clicking the link in a webpage, you know, early morning you’re having your coffee and trying to see what’s on news, that essentially goes through some of these models. This model, of course, if you see we really do get pretty good performance in most of test custom models and most of these are using Driverless AI to build a model and deploy them.
Finally, the third one that I wanted to talk to you about as a part of the whole, the custom solution development was something called DDoS detection. DDoS is, for people who don’t necessarily know, is distributed denial of service attacks where a large number of resources around the world, usually botnets, are marshaled to actually attack an organizational resource or any of these things. It could be pure revenge, it could be for monetary purposes, it could be basically to side channel another attack. A lot of organizations face DDoS attacks when they’re actually being attacked by another channel. This is a very common mechanism and vector when things are being done, which is why the model that we’ve built actually tries and identifies early warnings for any kind of DDoS that is happening around the world. This model, too, does a fantastic phenomenal job but the thing about this is all of these three use cases that I was talking about, or these three solutions that I was talking about have a specific set of features that we engineer that form a package for each of these solutions.
These packages contain very specific transformers. Many of them that were actually … many more than the one that was showcased to you. They also contain custom scorers, probably sometimes custom models as well. That’s all for specific purposes. You can look at these three as examples of applied about how you could use recipes to eventually build a model and then apply it to problems in space wherever you guys work.
Having said that, if you guys have any questions please let me know. If not I’ll … yes, please.
Speaker 13:
[inaudible] so what amount of retraining did you have to do to get to a 90% accuracy?
Speaker 12:
Could you let me know which one you’re specifically asking or is just any one of them.
Speaker 13:
All the three of them, to be honest? All of them are in the 90s, the accuracy of the model. Generally, you know, most data scientists settle for about between 75 and 80% so what are the incremental effort is where I’m going with the whole thing.
Speaker 12:
That’s actually a fantastic question and I’ll tease out something else out of that. One of the things is that I specialize in the field of anomalous behavior and malicious behavior. When I go through the process these things kind of come easy to me not because that’s what I do but it’s just … It’s not as much retraining and training that I went through probably a couple of models and so on. What I wanted to draw out of that is I you have subject matter experts in your organization the subject matter experts could work along with the data scientists, or your data scientists themselves could be subject matter experts, who could build a model with much less effort. In addition to that you have Driverless AI itself which gives you the best probable model or the best possible model in the whole, you know, this end of the pipeline. That kind of helps the process. I’m guessing that answers your question. Yeah.
Speaker 14:
In the DDoS, the false positives you had in the fraud detection or the money laundering, can you interchangeably apply, because there are false positives in DDoS also because there are favorable events which are happening which looks like DDoS but it’s not.
Speaker 12:
Yeah.
Speaker 14:
So you can apply the same false positive algorithm-
Speaker 12:
You’re saying the detection of false positives in DDoS?
Speaker 14:
Yeah.
Speaker 12:
You can. You can actually do that and that’s one of the fundamental things that we try to stick on to try and identify good behavior versus bad behavior. I mean, if you look at it, the very fundamental essence of how DDoS happens is the whole timing is very narrow when there’s an attack happening and when there is no attack there is a huge window where there’s nothing happening. We use a lot of features specifically as a function of time. I wouldn’t want to say time series itself, like a function of time which tends to give us more value. What is the actual behavior, what is not?
Speaker 14:
False positive in DDoS, and I’m pretty deep in DDoS, that’s why I kind of know that. Isn’t really about DDoS, isn’t always an attack, it’s an event.
Speaker 12:
True, yeah.
Speaker 14:
Some news occurs and then your website gets hit.
Speaker 12:
Yeah.
Speaker 14:
That’s what I meant.
Speaker 12:
Okay. You’re talking about Slashdot effect?
Speaker 14:
Yeah.
Speaker 12:
You can use the same model for the Slashdot effect as well. It just depends on what you define is the problem. You can use the same model is what I’m saying. That’s about it. Great. Thank you guys. Thanks.
Michelle:
All right. Thanks everyone for coming to the session. If you have any questions you can always come and ask us and I think there’s another session in this room.
Speaker 12:
Yes.
Michelle:
Okay. So we’ll let that happen now. Thanks.