Get Started with Driverless AI Recipes - Hands-on Training - #H2OWorld 2019 NYC
This video was recorded in NYC on October 22nd, 2019. Slides from the session can be viewed here.
Read the Full Transcript
Michelle:
Hello everyone. Okay, thatās on now. Weāre going to go ahead and get started soon with the Make Your Own AI Recipes session. Step one is Iām going to have everyone go to Aquarium so if you were in the previous Driverless CI training earlier today youāve already done this step but this will be the hardest step today. Weāre going to go to Aquarium.h2o.ai and make an account. Once youāve made an account weāll go onto the next step but all weāre going to do right now is go here and make an account.
Ā
Arno:
Thanks Michelle. Who here has not yet made this account so that we know? A handful. Perfect. Just raise your hand if you need help and we can fix any problems. The goal is to make custom recipes today with Driverless AI. To let you know this is for ultimate freedom for data scientists which they seem to enjoy, right? Data scientists want to understand every piece of the puzzle and every time we do Kaggle we feel like wow, thereās another secret, secret sauce somewhere and we have a bunch of Kaggle Grand Masters around here so they can maybe help later explain their custom recipes that they have contributed and you will get an understanding for how you make such magic potion, if itās not already in the system.
Ā
Michelle:
All right. Website there for everyone and then once you log in you should see this page. Iāll give you just a minute or two more and then weāll do the next step. Donāt click anything yet. Weāre trying to see this webpage. Can I get a TA in the front, please?
Ā
Arno:
Which lab are we going to click on? I think weāre at that point now.
Ā
Michelle:
Once youāve successfully logged in what weāre going to be doing today is this session. Davidās Driverless AI 1.8.0. You really want to click on this session. If you click on any of the other ones it will take about 10 minutes to start up. This one, if you click on, it will start up automatically and you will be happier.
Ā
Arno:
Who was in Davidās earlier session? A lot of people.
Ā
Michelle:
If you were in an earlier session, thatās a really good point. If you were earlier today after you click view details you can go ahead and end your lab and start another one. We have enough and that will give you the full amount of time since weāre past two hours at this point.
Yeah. These instances will be on for two hours and at the end of two hours itās going to turn off. Anything we do goes completely away. We can always start a new session which is what weāre going to do for those of you who were here earlier today. But we wonāt have what you built last time. Tomorrow you can come back and do it again if you want. All right. Is anyone not yet clicked on this lab? Are we all here? Okay, weāll give you a couple more minutes. I see some hands.
Ā
Arno:
One more side note is that these instances are very cheap, right? Theyāre Amazon throwaway instances that are probably four virtual cores or something, or eight virtual cores. Those are not necessarily the fastest systems to run Driverless experiments, so if you think that itās not that fast itās probably because the system is very slow. What you really want is SSD hard drives and you want mainly core CPUs, multi-core, multi-chip, multi-GPU, scale up basically. A fast, single server. Thatās usually the best. If you have eight GPUs on it you can do 10s of high-torch NLT models on steroids. If you donāt have GPUs you can still fit Light GPM, actually is no problem, but anything NLP will become really slow unless you do the statistical methods that come out of the box. But any [inaudible] will be slow. Those you have to enable in the product. Once you enable them you really want to make sure you have GPUs, otherwise youāll wait 100X longer.
Ā
Michelle:
All right. Hello to everyone that just walked in and joined us. What weāre going to do firstly before we jump into an overview of recipes is weāre going to go to Aquarium.h2o.ai, make an account, and once youāve made an account youāre going to click view details on the lab that says David Driverless AI 1.8.0. Looks good. All right.
After you click view details thereās a place where you can start your lab and thatās what weāll be doing in about 20 minutes after we start with an overview and explaining recipes and all of that. Is there anyone that has not yet made an account or found this lab? Raise your hand now or forever be lost. Okay, cool. We can go ahead and get started then.
What weāre going to talk about today in the make your own AI agenda, weāre going to talk about what custom recipes are, what BYOR is and how it fits into Driverless AI. Then weāll actually talk about what recipes are. Weāll then together walk through a tutorial of how to use custom recipes. We also have a tutorial on how to write custom recipes which we can talk a little bit about today but probably wonāt fit into the session. Then finally weāre going to have some of our experts on the stage to talk about how to write a recipe and give some examples of them.
At this point Arno, do you want to give us our intro to recipes? If not, I can keep going. You just mentioned it.
Ā
Arno:
I think you can do it just fine.
Ā
Michelle:
Okay.
Ā
Arno:
Iāll be here for questions.
Ā
Michelle:
Awesome, okay. As youāve probably heard a lot today about Driverless AI it has many key features. This is a platform for automated machine learning that can run in any major cloud or on PRM. We do automatic feature engineering. We build visualizations for exploratory data analysis. We have machine learning interpretability so that youāre not building a black box but youāre actually building a model that you can really understand and explain to your business users. The types of use cases weāre doing are supervised machine learning, so this is things like classification. Will someone churn or not? Or maybe multiclass. Which group does this customer fit into? Then we also do regression and forecasting. So predicting amounts or understanding what are sales going to be in every store for every week for the next three months. In addition to these use cases we can handle text data and we can also take that final model, the Mojo, and deploy it just about anywhere.
What weāre going to really be focusing on today, at this point weāre assuming that youāve seen Driverless before. Youāve probably used Driverless before and weāre going to be focusing on the bring your own recipes. This is the extensibility where you can add in your own feature engineering, your own models, or your own KPIs or scores in the Driverless AI to help shape and fit your use cases and data even better.
Driverless AI is an extensible platform that works on many use cases. We were built across industries. It wasnāt built for a specific type of industry or use case and thereās many different types of use cases that weāre helpful with. Some of them are here.
What recipes really helps with is letting you add in your domain expertise, often which might already be written in models or Python scripts, into Driverless. So you can add in your custom recipes to be tested on what weāre doing as well. So we have our highly tested, robust algorithms that work across industries along with your specific business knowledge on your domain.
This is the standard workflow for Driverless AI. We start by bringing in a tabular data set. This can come from basically any SQL data warehouse. Maybe your data is in the cloud, and we bring in the data set. We can do exploratory data analysis using AutoViz. This gives us the common statistical visualizations on every column and combination of columns in our data sets so we can understand that our data is as clean as weāre expecting and that thereās nothing suspicious in it.
The meat of Driverless is this automatic model building where weāre building new features on your data with automatic feature engineering. Weāre testing many different algorithms and then weāre tuning those algorithms. This is done in a genetic algorithm where over many iterations different models compete against each other and the best ones win.
Finally, once you have your final model we have a couple things to understand that the model is robust. First, we have model documentation which is governance that allows you to understand what data is in the model, what features were used, what worked, what didnāt work. So you can really understand in the back end the technical experience that went on. We also have machine learning interpretability so you can understand how the model makes decisions, which features are most important, which features are important for each individual person in your data, and so forth. Then all of this is done into an automatic scoring pipeline which can deploy to Java, Python, C++, R, and this can easily fit into a restover, Java API, wherever youāre deploying models today.
What weāre going to be talking about is adding in the bring your own recipes. It sits inside step four, step three, where out of the box we have 4D feature engineering which does everything from taking out a month from a date because maybe seasonality is important, to truncated SVD which is sort of an unsupervised clustering to understand how many different columns relate together. Now you can add in your own feature transformations. Like maybe thereās a common transformation you have to do on your data on every data set you have that relates to your domain. Or maybe thereās a specific algorithm that works for your use case that you want to add in that maybe we donāt have. You can also add in your own score, so maybe you donāt want to optimize for F1, maybe you know exactly how much a true positive costs, or a false negative, and how much a false positive costs so you can optimize for dollars instead of the generic F1.
Weāve been talking about them a lot but exactly what is a recipe? Recipe is a part of the machine learning pipeline that helps us build models to solve a business problem. We have sort of four main types of recipes weāre using in driveless AI. We have transformations, which are done on the original data to create new columns. Some of this might be to clean up the data. Maybe thereās a certain way you want to handle a null or maybe thereās a type you need to take a date and change the time stamp somehow. But itās making new features.
We also have the ability to add in data sets so this can be done in two ways. One, we have a new data recipe where you can use Python code to upload data directly into Driverless AI. We also can use data sets inside of transformations. For example, we have an example open source recipe which can take a ZIP code and look up the population density, school districts, and things related to ZIP codes. So it brings in new data about different ZIP codes in the US.
Another type of recipe is the algorithm that weāre actually modeling. In Driverless we have things like light GBM or XG boost but maybe you have a custom loss function you want to use with those, or maybe you just really, really like Scikit Learns Extra Trees, and you want to use that instead. Now you can add in those models. Then again we have the scores or KPIs so that you can optimize your problems for exactly what you want to.
In general, we call of these BYOR and these are bring your own recipes. You can add in your domain expertise into Driverless.
Okay, but why do you care about recipes? What recipes do is they give flexibility and customization to Driverless AI. Weāve built Driverless AI to be very robust across many different use cases but again, your domain expertise is really important in your data. Now you can add that in to recipes and to Driverless and we can test your transformations on your domain knowledge along with our generic algorithms.
We also have over, I think at this point, 115 recipes that weāve built and weāve open-sourced those on our public repo and weāll look at that today. Many of these were curated by our Kaggle Grand Masters or Arno so that you have the best of the best as examples for how to write yours, but also the add into your experience. If you really want to use, say, Arema or Facebook Profit for your time series experiments we have open-sourced those and you can add them into Driverless AI. All right.
Then recipes are really easy to use and weāll be talking about this today. Weāll all build an experiment today where weāll add in some custom recipes and see how theyāre used inside of our experiment. We do treat these as first-class citizens. So recipes you add in will be used just the way that recipes are used out of the box in Driverless, so theyāre not treated any differently and theyāll go through the same competition that the recipes that come with Driverless do.
At this point weāre going to actually get started with our tutorial. Iām going to have ⦠Iāll be walking you through the tutorial today but if you ever want to come back and do it again itās going to be on our public tutorial page and the one weāll be walking through today is Get Started with Open Source Custom Recipes. If you want to have this tab open to follow along on your own browser today you can. If not, I will just walk you through the experiments. All right.
Weāre going to jump back to Aquarium. At this point Iām going to have everyone go to Aquarium and weāre going to click on that Davidās Driverless AI, view details, and we will start our lab. Iāll give you just a couple minutes. If you have any problems with this please raise your hand. We have lots of people around the room that can help you get going. Then once you hit start lab there will be a link at the bottom and this is the link that youāre going to click and it will take you to your Driverless AI instance. Weāll give just a couple minutes for everyone to get to this page, which is going to load slow for me because live demo. There we go. This will be the page you see. When we all get here weāll get started.
Pardon? Yeah. Hereās the link.
Ā
Arno:
You can also find it from the H2O AI website under docs and then tutorials.
Ā
Speaker 3:
Arno, can you change the color scheme of Driverless AI?
Ā
Arno:
The question was, can you change the font or color scheme of Driverless AI? Yes, you can. You can invert it if you want the black on white instead of white on black or something, it actually works too, but there is not really a full pallette where you can choose arbitrary skins.
Ā
Speaker 3:
Font?
Ā
Arno:
Not really. So you would have to ⦠you can talk to our designers but, yes. The original design was very futuristic two years ago and now we kind of live with it. Iām sure we can make a custom version of the skin. Itās only a few settings, the colors are set and the font. Okay.
Ā
Michelle:
All right. Please raise your hand if you have reached this page. I have. Then opposite, if you have not reached this page. We did it. Good job guys.
To log in the user name and password is going to be training. I will zoom in so thatās actually viewable. Iāll go ahead and sign in. All right.
In this instance we have many preloaded data sets. Question? Can I have a friend in the front? Thank you. Okay. All right.
On this instance we have many preloaded data sets and many pre-ran experiments but weāre going to actually load in a new data set together. Iām going to have everyone click on the add data set button and weāre going to click on file system. All right. Thereās lots of options when you click file system. There we go. You should see this and then weāre all going to click on data and itās going to load super slowly because youāre all watching my screen. Okay. Iām going to ⦠there we go. Got it. Weāre going to click on data. Weāre going to click on splunk and weāre going to choose churn.
Once youāve selected churn you can go ahead and click import selection and that will load in the data set and it will also do some summary statistics so we can understand whatās in the data set. Thatās data, splunk, churn. All right.
The first thing, many of us probably did this earlier in the hands-on training but weāre going to really quickly look at the data set so we know what weāre modeling today because thatās pretty important. Youāre going to click on the churn data set and weāre going to click on details. All right. I like to start with the data set rows. Itās kind of readable up here.
What weāre looking at here is each row is a person who is a customer of a Telco. We have information about how often they speak during the day, how often theyāre charged in the morning or the evening. If they do any international calling. How often theyāve called customer service because they had some sort of problem. Then what weāre going to be predicting, the column on the very end, is churn. Some of these people have left the company and some have not.
Iām going to click back on data set overview and here we can see every column in our data set, so Iām going to search for my churn column and here I can see that itās a little bit of an imbalanced data set. 14% of my customers have churned. What weāre going to do today is build some models to see if we can understand which customers are going to churn and which ones wonāt so that in the future as new customers come in or each month we can see whose likely to leave us. All right.
Weāre going to start with no custom recipes and just run an experiment. To do that weāre going to click for actions and go to predict. All right. The first thing Iām going to do is click display name. Iām going to name my model. This is going to be our Baseline Model. You can name it whatever you want but Iām going to call it baseline because weāre not going to add in any custom recipes. Weāre going to just let Driverless do itās own thing and then weāll add in some custom recipes and compare.
Remember that Driverless is supervised machine learning so the only thing you have to do is tell it what you want to predict. What we want to predict is churn. Now we have many default settings that have been set for us. Is everyone at this page? Anyone need me to slow down a little bit? All right. Iām going to take that as a safe ā¦
Ā
Arno:
Youāll notice weāll switch to log loss instead of AOC out of the box because we know itās imbalanced and itās one of the things thatās our freedom right now but you can override that with your custom score or you can pick a different score yourself. Itās just our defaults and those are sometimes changing from month to month without us documenting everything. Apologize if itās not always obvious what happens but we do our best to make the models better out of the box.
Ā
Michelle:
Perfect. Yeah?
Ā
Speaker 4:
[inaudible]
Ā
Michelle:
Great question. Thereās two ways that we split the data in experiments. One of them is while weāre building many models, because thatās what an experiment does. We build hundreds of models. Thatās automatically going to do cross-validation. Weāll actually see here in the, āWhatās happening?ā There we go. This is going to tell me the ⦠in my experiment itās going to do three-fold cross validation. But I often am going to want a holdout test data set at the end that wasnāt used at all in the experiment so I can see that my model is actually robust. For that, that partās not done automatically but we can click on data sets and when we click on a data set thereās an option to split. So here we can split either stratified split if itās like a binary classification or you can do a time split if itās forecasting. Yeah.
Today weāre doing kind of a small data set and weāre focusing on recipes so Iām not going to split it, but if we were doing a real business case we totally would. Yeah. The question was, does Driverless automatically split data?
All right. Iām naming my model Baseline and weāre predicting churn. All right. Weāre going to change some of these settings down a little bit just for runtime so we can run a couple experiments today. Iām going to drop my accuracy down to four and my time to one. All right. Over on this side weāre going to see all the things that are going to happen. Everything here are models and feature engineering that comes out of the box. We have GLM, GBM, XG Boost, and then these are feature transformations that come with Driverless. In our second experiment weāll run is weāll add in some feature engineering and weāll see them show up here.
At this point weāll go ahead and click launch experiment and have everyone start running their baseline model. Driverless gives us some nice notifications at the bottom. We see that we have a little bit of an imbalanced data set. We saw that in our data details page. Only 14% of our data churned which is good for our Telco. But we are given the option to turn on imbalanced models if we want to, so thatās a nice new feature of driveless. But Iām going to close my notifications and we can see a model. We have our first model built and in the center here we can see the features. So our top features, our original features. How much someone is being charged, which isnāt too shocking. People are leaving the Telco because they donāt like how much money theyāre spending. But itās nice that Driverless is seeing that and validating our super-good Telco business knowledge.
We can also see a little bit of engineered features here. Weāre looking at customer service calls and weāre looking at the interaction between how people talk on the phone during the day and what theyāre getting ⦠how they talk in the evening.
We can see several models built. Oops.
Ā
Arno:
One quick comment.
Ā
Michelle:
Yes.
Ā
Arno:
All the numbers here we see, the validation performance, the numbers there, they are always out of hold, holdout data. So itās never on the data that it was trained on. Itās always on unseen data. Weāll give you an estimate of how the model will do on new data and weāll even give you an error bar, so in the bottom left you always have a little error bar. Thatās basically the bootstrap variance or the standard deviation of the bootstrap meaning we basically do a bunch of repeated scorings on unseen data and weāll give you the distribution of that. Itās our best effort to give you the estimate of the number. Sometimes youāll see a confusion matrix with floating point values, thatās also because itās the best estimate of the number. You donāt need to necessarily use integers just because the data has integers. If thatās our best estimate weāll give you that. All numbers are always on unseen data.
Ā
Michelle:
Thank you. Yeah. All right. At this point hopefully everyone is at about 100%. No? Weāre not at 100%. Are we running the experiment?
Ā
Speaker 5:
71%
Ā
Michelle:
Itās trying hard. Thatās almost 100. Weāll round up. Weāre going to now all jump to the GitHub repo where weāre going to get some of the recipes weāll be using today. Iām going to open up a new tab and weāre going to go to GitHub and weāre going to search for H2OAI Driverless recipes. Thatās kind of hard to see, isnāt it? There you go.
All right, before I click raise your hand if you have clicked here and you know the link and youāre all good to go. Okay, about half. I will wait. All right. Letās try again. Who has made it to the Driverless AI recipes repo? And who has not? All right. Still waiting. Iāll be patient.
Ā
Arno:
[inaudible]
Ā
Michelle:
To Driverless? Do you want me to zoom somewhere? What do you need? I went to a new tab and I went to GitHub.com. Yeah, yeah, yeah, yeah. Good point.
Ā
Arno:
In case you havenāt seen the other repositories thereās H2O repository itself, right, and then thereās also data table. Thatās the Python data table repository so in Driverless weāre using Python data table as our workhorse for our data managing. Thatās a very good tool if you want to use just a Pandas replacement for fast grouping, sorting, joining, file reading, file writing. It can write CSVs faster than other tools can write binary. Itās really good and we use it for Driverless AI. And itās totally open source so thatās the H2O AI data table repository but in this case weāre looking at the recipes.
Ā
Michelle:
What was the question?
Ā
Speaker 6:
[inaudible]
Ā
Arno:
I was saying data table is faster than Pandas because itās a recent development in C++. Highly optimized and some of these recipes that weāre showing you will use data table as the basic tool, basically.
Ā
Michelle:
Itās in Python, as well.
Ā
Arno:
There is an R and a Python version and both of the main developers for either are working with H2O. Yeah.
Ā
Michelle:
Okay, all right. Once weāve made it to the Driverless AI recipes, here on the home page the readme is very thorough. It explains basically everything Iāve told you today. What are recipes? How to use them, why you care. But the part we like, just go down to the bottom and we can see the sample recipes. Actually Iām going to have everyone scroll back up and in case you havenāt used GitHub before one of the benefits of GitHub is you can have different repositories, so weāre all running Driverless AI today, 1.8.0, but there are many versions of Driverless. So thereās many versions of different recipes. What Iām going to have everyone do is click on the branch button inside of Driverless recipes and youāll search for 1.8.0. Whenever we use recipes we want to make sure the recipes weāre using have actually been tested inside the version of Driverless weāre using. If you were running in 1.7 youāll use 1.7 recipes or 1.8 youāll use 1.8 recipes. I know thatās hard to see. Sorry. There we go.
Ā
Arno:
Thereās also a link down in the readme page to get to that specific branch in case you donāt know how to find that. There shouldnāt be too many branches.
Ā
Michelle:
Awesome. All right. Once we have clicked on 1.8.0 weāll see for 1.8.0 there are 107 recipes and the latest branch is 115, so weāre always adding new ones. What weāre going to add first today is a feature transformer. Weāre going to scroll down to the bottom, to transformers, and the feature weāre going to add today, so out of the box Driverless has a feature called interactions that will look at mathematical operations between two columns. We actually saw that in our model. There it is. We added two columns together and that was a new feature that was important in our model. But one thing you might have noticed about this data set is we have day, evening, and night.
So there are multiple columns that it might make sense to add together. What weāre going to build together is a transformer which takes three or more columns and adds them. It could also add, subtract, multiply. Well, youād have to think about how subtract would work but it can add or multiply. But weāre going to add in a sum transformer so weāll add three or more columns and then weāll run that on our data set.
Weāre going to scroll down to numeric because we are going to be augmenting our number data. This is transformers, numeric, and then we will all choose sum.
A little bit later today weāll probably walk through this code and talk about sort of whatās happening but at this point all weāre going to do is take this recipe and put it in Driverless. You get to trust me that it takes three or more columns and adds them together. To do that weāre going to click this button that says raw and weāre going to copy the URL at the top. All right. I can come back here for anyone that needs to but the next step, so weāre just getting this URL and this is a way to link to recipes that are, say, in GitHub or on the web somehow.
Thereās two ways to upload recipes. One is you can give a link to wherever itās stored. The other is if itās on your local machine you can just upload it. If youāre going to start building recipes one of the things you might want to do is download this GitHub repository and then maybe change some of them or use them straight from your computer. But today weāre going to just upload them from this link. Weāll copy this link and then weāll go back to our Driverless AI instance. Weāre going to click on experiments and weāre going to have our baseline experiment here. Iām going to click the three dots on side and Iām going to say new model with same parameters.
If you remember we changed the dials so that we would run really fast. So that we donāt have to remember what those are weāll just click new model, same parameters, and then weāre going to get the same dials. If we had changed any expert settings would get the same expert settings, too. If youāre making lots of changes for your experiments this is a nice way to not have to remember what it all is.
Yes?
Ā
Speaker 7:
[inaudible]
Ā
Michelle:
Yeah, we havenāt done it yet. I promise, weāre getting there. Yeah.
Ā
Speaker 7:
[inaudible]
Ā
Michelle:
Yeah. To experiments? Yes. Iām going to click on the baseline experiment. This will be named whatever you named yours. I named mine baseline. Iām going to click new model with same parameters.
Ā
Arno:
And this will use the feature brain that was saved from last time. Next time the model will already be smart and use whatever it learned from the first experiment because in the expert settings the feature brain level was not set to zero but two, which is the default. That means that if you found good features it will retry those to start with. But obviously if you have new transformers like our great special transformer Michelle just added then maybe that that guy will be, weāll see.
Ā
Michelle:
Good, all right. Does everyone see this familiar experiment screen? It should be basically identical to what we saw earlier except for the display name will be 1.Baseline. Raise your hand if youāre not here? Okay, awesome. Okay.
Ā
Speaker 8:
Hereās the issue with Internet Explorer.
Ā
Michelle:
Okay, okay.
Ā
Speaker 8:
It takes forever.
Ā
Michelle:
Okay. If you want to just set up a new experiment, Joe, itās 418 on the settings. Okay, all right.
The first thing Iām going to do just so I remember what I did in this. Iām going to change the display name from 1.Baseline to some transformer. Because thatās what weāre going to add to this one. To do that, to add recipes, so thatās our question, weāre going to click expert settings and we have three options at the top. The option weāre going to use is load custom recipe from URL. Weāll click this and this is where we will paste our URL.
Yes. Expert settings, which is right above the score. Expert settings so thatās expert settings.
Ā
Speaker 9:
These custom recipes, are all this Python or they can be unrelated?
Ā
Michelle:
At this time they are Python. If you have something in like ⦠The question was do custom recipes have to be in Python? If you have something in like Go or R if itās wrapped in Python, thatās fine. Arno can answer that better. But basically thereās Python involved.
Ā
Arno:
You basically reveal or plain source code and if itās Scikit Learn or Pandas you donāt need to any imports, whatever, but if you have something extraordinary, some Facebook Profit or something, you have to be able to specify those dependencies. We can say Facebook Profit, these versions, and then it will do an install for you on the platform while itās running. So you donāt even have to restart or anything. You just upload the plugin. But it definite is Python but you donāt have to only use Python. You can call anything. You can make a rest call to Googleās auto-email and say you decide what I should do? Right, roughly speaking. But then the production pipeline will be a little weird because that also will make those rest calls, right? So we have to think about how far you want to go with it. If you just want to play around, you can do whatever. If you want to productionize it then you have to be more conservative.
Ā
Michelle:
Awesome, all right. Letās go ahead and add in that recipe. Again, thatās going to be expert settings and then the center option, load custom recipe from URL. All right. Then I will paste in my recipe and hit save.
What youāll see currently is it went really fast on this recipe but there was a quick little wheel where acceptance tests were being ran. When you add in, say, a model, when you write a model you can say what types of problems itās for. So maybe you write a model thatās only for binary classification. We will do a quick test on small data to see how it goes. So weāre testing your transformers, your scores, and your models for best practices like you donāt want a transformer that changes the number of rows in your data set or something like that. Those safety checks are built into Driverless and it will give you mostly helpful error messages if something fishy goes on. Good. All right.
We all added in our custom recipe so weāll go ahead and click save. Awesome. Then Iām going to zoom in and youāll see now that sum is a list of one of the features that itās going to try. This is now a list of one of the features that I might try in my experiment. One thing, Arno, Iām going to let you talk about this one. Thereās a new option expert settings for official recipes.
Ā
Arno:
Thatās just a link to GitHub.
Ā
Michelle:
It doesnāt ⦠I thought it loaded all of them.
Ā
Arno:
This we could have used to go to that website directly, to the right branch.
Ā
Michelle:
Got it.
Ā
Arno:
Itās a shortcut.
Ā
Michelle:
So you donāt have to know what version youāre in. Perfect. We did it the hard way but thereās an easy way. The fun thing about Driverless is thereās new features all the time. Itās always getting better.
Ā
Arno:
One thing quickly to show maybe under recipes thereās a pull-down where you can choose which transformers.
Ā
Michelle:
Yes.
Ā
Arno:
Just in case you want to make sure that that sum is enabled it will actually show up yellow there as one of the enabled ones. You can disable specific ones and in the next version you will also be able to specify which column gets which transformers. Sometimes you donāt want age to be moved, letās say. You want it to be purely numerical, not categorical. Then you can say, do not do any binning on it, for example. You can specific per column what transformers to allow but not in this version.
Ā
Michelle:
Awesome, great. Again, the location for that, so if you go in expert settings to recipes there in transformers, models, and scores you can turn on or off any recipe including custom ones we just uploaded. We just uploaded the sum transformer so itās available on the system now. Maybe for fun we want to try an experiment where we only use the recipe that we just built to see how it does on our model. We probably wouldnāt use that for sum but maybe for your business domain knowledge you might want to test only your model or only your transformation and so you could do that.
Ā
Arno:
Thereās the original transformer and then thereās some cat original ones, so the ones that are categorical in numeric columns you can say just use as is, donāt change them. But then the model has to be able to handle categoricals natively, right? Which Light GBM can do but [inaudible] cannot so we will only accept what makes sense. We are smart enough to not let you screw up our models but if you have your own models then you have a Boolean that says, I accept the non-numeric input true or false. By default it will say no, it has to be all numeric. But if you have a factorization machine or an FTRL or something or something with a hashing that takes any input, any strings, and then turns it into a vector inside then you can say yes, I can do non-numeric data. Thatās it.
So thereās all these configuration options and in GitHub, as hopefully weāll see later maybe, thereās some templates that show the source code of the actual custom transformers thatās all documented, all the fields that you can set in Python are documented. Let us know if itās not clear or if you have questions but we did our best to talk you in as much as we can. This is only the very simplest example. And maybe we can look at the source code too for that sum transformer while itās running.
Ā
Michelle:
Absolutely. All right. For time today so we can jump into some of the source code soon, weāre going to go ahead and add in a score and a model now so we can see how that works and then weāll run one experiment that has three different custom recipes we added in. Iām going to have everyone jump back to their GitHub and youāll probably see the raw link that you copied. You can just click back and that will take you back here.
This time Iām going to go back to recipes because I donāt want anymore transformers. Weāre good with our transformers. We will go ahead and choose a score. This time instead of scrolling down the readme Iām going to actually just jump into the type of score I want. Weāre going to all click scores and then weāre doing a binary classification problem so we donāt need any regression scores, we just need classification scores. We have this all sorted so you can kind of find what youāre looking for. Then again we want binary classification problems.
We have a couple custom recipes you can choose from. The one that weāll use together today is false discovery rate. In a way this is basically type one error or the opposite of precision, but itās basically how often we are flagging things wrong. Weāll click on false discovery rate, FDR. That process is to click raw. Raise your hand if you havenāt made it to this place in the repo. Awesome, weāre doing good. Weāll click raw, weāll copy the link, and weāll jump back to Driverless. All right.
In the expert settings same exact step. Load custom recipe from URL. We will paste our recipe. Weāll run some acceptance tests that run really fast and weāll hit save. Now if we want to add this new score that we just added in weāre going to click the scores button, which is currently log loss, and weāll see that FDR is there. Thatās our false discovery rate that we just added in. Now this model wonāt directly compare to the other because weāre switching scoring metrics but we can optimize for basically any business metric or KPI that you need to in your business. All right. Then the last one, was everyone able to add an FDR? Rooseveltās in? All right, cool.
Weāll go back to the GitHub and click the back link. Go back to recipes and this time weāre going to choose a model. Weāre all going to click on models. The type of model weāre going to do today, so we have some time series models which is going to be like ARIMA or Facebookās Profit. We have some NLP models so that you can do modeling specifically related to text. And then we have custom loss so this is changing the loss function for things like XG boost, but weāre going to go into algorithms and just add in a completely new algorithm.
Thereās several that we could choose from. Maybe you really like CatBoost or youāve seen that KNN works for your use case. But what weāre going to do today is Extra Trees. This comes out of Scikit Learn. Itās just using Scikit Learnās Extra Trees inside of Driverless AI. So weāll click on that one and get our URL link. Thatās Extra Trees.
Iām going to go back to Driverless in my expert settings. Click load custom recipe and add in my Extra Trees. This one will probably take a little bit longer and we can actually see it testing, so itās seeing that Extra Trees works for regression, itās making sure it works for binary classification. Whether or not the binary classification is a number or category. Weāre just going through and making sure that the recipe is actually going to work and not crash in the middle of an experiment. All right. Once the acceptance tests have passed weāll click save and then we can again zoom into our, what do these settings mean and thereās our good pal, Extra Trees.
Again, like Arno was saying, if we wanted we could then go into expert settings and maybe turn off GLM or Light GBM because we only want to test Extra Trees, or maybe we decided I uploaded this recipe but I changed my mind. I actually donāt want it anymore. So we can go back into the recipes and unselect it. All right.
At this point we have ⦠Iām going to change my display name because itās not true anymore and Iām just going to say that this is BYOR. Because we have a custom score, a custom algorithm, and a custom feature engineering. Then just to be super clear, we could absolutely add more than one for each of these. Like if you wanted you could add everything from the GitHub but it would probably run a little bit of a long time.
Ā
Arno:
Yes, you can actually provide the whole GitHub root directory so just GitHub.com/H2AIrecipes/ and then post that there and then it will upload everything which will give you a lot of choice, but it will also take maybe 20 minutes or so to test everything.
Ā
Michelle:
Yeah, all right. Raise your hand if youāve made it to this page with all your custom recipes? Awesome, looking good. Now weāll go ahead and click launch experiment.
As this runs weāll see a couple different algorithms popping up here, including our Extra Trees. I donāt need the notes right now. Weāll probably see our sums feature engineering pop up and then all the scores weāll see will actually be false discovery rate. So itās going to be different than our other models which was log loss but weāll be looking at the false discovery rate there. While we give that a few minutes to run ⦠itās actually starting. Thereās our false discovery rate so far and currently we have an XG Boost GBM but some of these models would be Extra Tree. All right. Here it goes. Weāre tuning. At the top it just said tuning Extra Trees so it should be there soon.
Thereās our Extra Tree model. Itās currently not doing as well as just XG Boost GBM and thatās sometimes the case. Maybe the recipes we add in arenāt going to do as well as whatās just in Driverless, especially if itās maybe not domain specific, weāre just trying another algorithm. But this would be a way to test out and see how it works. It looks like weāre almost at our final model and XG Boost is still winning. Yeah.
At this point weāre going to jump back to the GitHub and talk about, a little bit about how do you start writing recipes and weāll do that a little quick because weāre close on time today. But then we can take some questions, too. All right.
Okay. So in addition to consuming recipes, so today we all consumed recipes. We took some of the 115 that are publicly available and used them but you might want to start writing your own. Take your business knowledge, which maybe is already written in Python, but write it so that it can run in Driverless and then start testing that as well. We sort of wanted to go over quickly the process. The process of writing recipes, itās usually good to start with a sample data set in something like PyCharm or maybe you really like Jupyter Notebooks and build the recipe outside of Driverless. Just test the algorithm you want to do works.
My example, I started with my sum transformer which takes three or more columns so I just started with a Pandas dataframe which had seven columns and then looked at how do I add those together? Thatās pretty straightforward in Pandas. Itās dataframe.sum but, you know, for more advanced recipes you might have to do a little bit more debugging.
We next recommend downloading the Driverless AI recipes repository and it has all the examples. Then you get all 115 so some of those might be similar to what youāre trying to do and you can start to understand how recipes work and see some example code there.
We also have the recipe templates which weāll look at really quickly today and this is basically a template for models, feature engineering, and scores. It gives you every possible option and explains them pretty well with good comments on what they do and why youād include it and how to write it. Probably the best place I would say, write your code so that it works just in Python in general and then go to a template and add it into the template.
What it takes to actually write a custom recipe. The only thing you need is a text editor, so somewhere to write Python. I would recommend maybe doing something like PyCharm. PyCharm is really nice because you can have, if you download the whole Driverless repository, you can have all ⦠are we done with pictures? Yeah. Okay, Iāll show you. You can have all the ⦠Itās not showing you. There we go. You can have all the recipes on the side so then this is the full Driverless AI repository and so if I am writing a model I might be able to open every model and go look at how they work. That can be really helpful. Then you can also debug in PyCharm. All right.
Thatās really the only thing that you need. Other than a Driverless AI instance to actually run it on. Bring your own recipes are for 1.7 and later so if you have 1.6 or 1.5 recipes wonāt work there. But the 1.7 series and the 1.8 series, which is the latest release, is where you can actually run your recipes. To actually test your code locally itās Python 3.6 and then data table and the Driverless AI Python client. Then to write a recipe you need to know a little bit of Python code, so thatās sort of the skillset required is ⦠yes?
Ā
Speaker 10:
Will Python 2.7 work or 3.6 works?
Ā
Michelle:
Just 3.6 at this point.
Ā
Speaker 10:
Thank you.
Ā
Michelle:
Yeah, all right. Then one more thing is how to actually test recipes. The easiest way to test if your recipe is working or not is to upload it to driveless AI where we run the acceptance tests and it will return an error message saying why your recipe didnāt work. Maybe it returned less rows than it started with which a transformation canāt do. Or there is some other issue. Maybe it was youāre writing a model which uses PyTorch and it wasnāt able to install it or something like that.
The next way would be to use the Driverless AI Python and R client to sort of automate this process. We have some example scripts on the GitHub of how to do this in Python, but basically itās a script that you can say what the data set you want to use is and the transformation location on your computer and then it will just upload a data set, upload the transformer, turn everything off but that, and then return any error messages. So thatās a nice way you donāt have to point and click and upload, you can just make changes and quickly iterate.
Then if you want to test locally we are going to extend the base class of the recipes, which is what weāll look at and theyāre a dummy version so you can actually run that on your local machine without Driverless. Those are the three main ways for testing.
Then the last thing before we jump into showing you where the templates are is what happens if you get stuck writing custom recipes? Thereās a couple things you can do. Error messages and stack traces from Driverless AI. If you look in the logs or if your recipe doesnāt pass an acceptance test it will give you the full stack trace there. Sometimes that helps with pinpointing whatās going on. But maybe you get that full log and it doesnāt mean anything to you and what next? You can take the logs. First you can look on the FAQ in the templates to see if thereās similar issues to your recipe, maybe that gives you a direction.
Actually itās not coming soon anymore. There is a tutorial now. Today we talked through the tutorial of how to use recipes but the second tutorial Iāll point you to is to how to actually write them and it will write the three recipes we used today. The sum transformer, FDR, and Extra Trees. You can actually look at maybe thereās a step weāre missing. Then finally, we have a Slack community channel so you can always join the Slack community channel and our experts are on the channel to answer questions so you can add logs and we can tell you whatās going on there.
Ā
Arno:
Just one last comment. You can use data table if youāre interested in learning something new or you can use Pandas or NumPy. So it can do X2 Pandas or X2 NumPy and then you are in old-school country where everything is as expected. But if you want to learn how data table works and see how fast it is you can just stay in data table and learn that new API. One of your Kaggle Grand Masters has ported a lot of code to data tables so most of our examples will actually be in data table because thatās the fastest way to handle scalable big data. You donāt want to do two Pandas too often unless you have to because data table is multithreaded and Pandas is not.
Ā
Michelle:
Perfect, great. The last thing I want to tell you about before we jump into ⦠one more thing is inside of models or transformers or scores there is a template file. For example, for the model template file this is the base class. It has a lot of commented code and itās everything that could possibly be in a model. Every option. So thereās things like we can say if our model is allowed to be regression or binary classification or multiclass you can turn those on and off. It tells you how to actually write the fit function, the predict function. Weāre not going to walk through this today for time but Iād absolutely check out the templates. Theyāre a really good place to start and you can just copy the template and then just change the things you need to to write your recipes.
At this point, weāve talked sort of about some basic transformers. We added some columns, but in addition we can do custom recipes that are for very specific verticalized industries. Weāre going to actually talk about some of that for a few minutes. Yes, let me get the slides for you. Itās going to be just a second. Do you want to chat while Iām ⦠do you want to chat while Iām getting them? Okay. Yes, I got them.
Speaker 11:
I think part of the thing is that folks have been here for a couple hours. Youāve learned about the basics of Driverless and then going into what are recipes, how to use them, and so what we want to show next is the art of the possible, and so more advanced recipes and what can be done. So kind of give you guys pictures, learning a little bit about Driverless, learning about recipes, and how far can you go with something like this.
Speaker 12:
Hey. Iām guessing you guys can Hear me. She just worked it. Iāll take you through three different use cases that I worked on where weāve used custom recipes. One, of course, is anti-money laundering, the other one being malicious domains, and the other one is something which is called DDoS Detection. Itās distributed denial of service attacks in the field of cyber security. The first one, thatās up in the area of money laundering and fraud and that too comes up in the field of cyber security. What Michelle actually showcased is once specific transformer that you guys have used to try and transform the data in a certain way so that it fits their model. What weāve done is weāve taken use cases where weāve built extensive sets of features, like about hundreds of features that are specifically designed for specifically designed for specific use cases and in this case the first one of course Iām talking about is the anti-money laundering one.
What we very simply tried to do is tried to reduce the false positives that exist in the system. Essentially we have a bunch of alerts which say that they are actually money laundering. Not all of them are actually money laundering and one of the big problems is this is a global problem. You have about 75% to 99% of these alerts that are actually false positives. Essentially itās not a big deal to have these false positive alerts but it does cost you. I mean, the manpower thatās actually required for it is quite extensive. Which is why reducing it actually helps a lot.
For that we actually built an entire module, a package, which purely solves anti-money laundering and this package has different custom transformers that are very specific to this behavior. Using that we are able to probably come across ⦠Weāve actually reduced the false positive rate by over 94% and this is essentially how you can think about using the custom recipes as well. You can have your data scientists, most of you guys here are also data scientists yourselves so you can start building features as a part of Driverless AI which stacks to add into the larger set, comprehensive set of features, or transformers for that matter, which eventually become an entire package for a whole particular solution if youāre looking at it that way.
The second example that I want ⦠by the way, please let me know if you guys have any questions.
The second example that I wanted to talk to you about is something that is malicious domains. This essentially is a very big problem that exists globally in the field of cyber security. What happens is there are these really large number of dynamic domains that come up that attackers seem to use as a means of trying to get you to phish, trying to basically get you ⦠try to launch an attack on you, your organization, any of these things. These are usually vectors for DDoS as well. One of the things that the models need to do is try and detect that this domain that has actually come to you through your email or something that came on a website or any of these things is actually malicious. Essentially, there is literally, probably like 300 milliseconds or something often amount of time that the system has from the time you click the link to be verified whether itās malicious or not and then come back to you saying, okay, this link is good or this link is bad. Essentially, thatās the whole process.
The model that we built looks at different aspects of the domain, it looks at the health of the domain, the longevity. When was it registered? How was it registered? Whatās the formation of the domain? What kind of location is it at? It looks at many different features in this case and then itās able to classify whether this domain is good or bad. That essentially helps feed the model when you and your organization are going ahead and clicking a simple link. The simple process of clicking the link in a webpage, you know, early morning youāre having your coffee and trying to see whatās on news, that essentially goes through some of these models. This model, of course, if you see we really do get pretty good performance in most of test custom models and most of these are using Driverless AI to build a model and deploy them.
Finally, the third one that I wanted to talk to you about as a part of the whole, the custom solution development was something called DDoS detection. DDoS is, for people who donāt necessarily know, is distributed denial of service attacks where a large number of resources around the world, usually botnets, are marshaled to actually attack an organizational resource or any of these things. It could be pure revenge, it could be for monetary purposes, it could be basically to side channel another attack. A lot of organizations face DDoS attacks when theyāre actually being attacked by another channel. This is a very common mechanism and vector when things are being done, which is why the model that weāve built actually tries and identifies early warnings for any kind of DDoS that is happening around the world. This model, too, does a fantastic phenomenal job but the thing about this is all of these three use cases that I was talking about, or these three solutions that I was talking about have a specific set of features that we engineer that form a package for each of these solutions.
These packages contain very specific transformers. Many of them that were actually ⦠many more than the one that was showcased to you. They also contain custom scorers, probably sometimes custom models as well. Thatās all for specific purposes. You can look at these three as examples of applied about how you could use recipes to eventually build a model and then apply it to problems in space wherever you guys work.
Having said that, if you guys have any questions please let me know. If not Iāll ⦠yes, please.
Ā
Speaker 13:
[inaudible] so what amount of retraining did you have to do to get to a 90% accuracy?
Ā
Speaker 12:
Could you let me know which one youāre specifically asking or is just any one of them.
Ā
Speaker 13:
All the three of them, to be honest? All of them are in the 90s, the accuracy of the model. Generally, you know, most data scientists settle for about between 75 and 80% so what are the incremental effort is where Iām going with the whole thing.
Ā
Speaker 12:
Thatās actually a fantastic question and Iāll tease out something else out of that. One of the things is that I specialize in the field of anomalous behavior and malicious behavior. When I go through the process these things kind of come easy to me not because thatās what I do but itās just ⦠Itās not as much retraining and training that I went through probably a couple of models and so on. What I wanted to draw out of that is I you have subject matter experts in your organization the subject matter experts could work along with the data scientists, or your data scientists themselves could be subject matter experts, who could build a model with much less effort. In addition to that you have Driverless AI itself which gives you the best probable model or the best possible model in the whole, you know, this end of the pipeline. That kind of helps the process. Iām guessing that answers your question. Yeah.
Ā
Speaker 14:
In the DDoS, the false positives you had in the fraud detection or the money laundering, can you interchangeably apply, because there are false positives in DDoS also because there are favorable events which are happening which looks like DDoS but itās not.
Ā
Speaker 12:
Yeah.
Ā
Speaker 14:
So you can apply the same false positive algorithm-
Ā
Speaker 12:
Youāre saying the detection of false positives in DDoS?
Ā
Speaker 14:
Yeah.
Ā
Speaker 12:
You can. You can actually do that and thatās one of the fundamental things that we try to stick on to try and identify good behavior versus bad behavior. I mean, if you look at it, the very fundamental essence of how DDoS happens is the whole timing is very narrow when thereās an attack happening and when there is no attack there is a huge window where thereās nothing happening. We use a lot of features specifically as a function of time. I wouldnāt want to say time series itself, like a function of time which tends to give us more value. What is the actual behavior, what is not?
Ā
Speaker 14:
False positive in DDoS, and Iām pretty deep in DDoS, thatās why I kind of know that. Isnāt really about DDoS, isnāt always an attack, itās an event.
Ā
Speaker 12:
True, yeah.
Ā
Speaker 14:
Some news occurs and then your website gets hit.
Ā
Speaker 12:
Yeah.
Ā
Speaker 14:
Thatās what I meant.
Ā
Speaker 12:
Okay. Youāre talking about Slashdot effect?
Ā
Speaker 14:
Yeah.
Ā
Speaker 12:
You can use the same model for the Slashdot effect as well. It just depends on what you define is the problem. You can use the same model is what Iām saying. Thatās about it. Great. Thank you guys. Thanks.
Ā
Michelle:
All right. Thanks everyone for coming to the session. If you have any questions you can always come and ask us and I think thereās another session in this room.
Ā
Speaker 12:
Yes.
Ā
Michelle:
Okay. So weāll let that happen now. Thanks.