H2O.ai Prague Meetup #5: Scalable Automatic Machine Learning in H2O
This video was recorded in Prague on November 27, 2019.
Talk 1: Scalable Automatic Machine Learning in H2O by Erin LeDell
The focus of this presentation is scalable and automatic machine learning using the H2O machine learning platform. H2O is an open-source, distributed machine learning platform designed for big data. The core machine learning algorithms of H2O are implemented in high-performance Java, however, fully-featured APIs are available in R, Python, Scala, REST/JSON, and also through a web interface. Since H2O’s algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine.
We will provide an overview of the methodology behind H2O’s AutoML algorithm. H2O AutoML provides an easy-to-use interface that automates data pre-processing, training and tuning a large selection of candidate models (including multiple stacked ensemble models for superior model performance), and due to the distributed nature of the H2O platform, H2O AutoML can scale to very large datasets. The result of the AutoML run is a “leaderboard” of H2O models which can be easily exported for use in production.
R and Python code with H2O machine learning code examples are available on GitHub for participants to follow along on their laptops.
Erin LeDell is the Chief Machine Learning Scientist at H2O.ai, the company that produces the open-source, distributed machine learning platform, H2O. At H2O.ai, she leads the H2O AutoML project and her current research focus is automated machine learning. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE) and Marvin Mobile Security (acquired by Veracode), the founder of DataScientific, Inc. and a software engineer. She is also the founder of the Women in Machine Learning and Data Science (WiMLDS) organization (wimlds.org) and co-founder of R-Ladies Global (rladies.org). Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from the University of California, Berkeley and has a B.S. and M.A. in Mathematics.
Talk 2: Off-Policy Partial Feedback System Reward Estimation in Seznam.cz Web Search Engine by Pavel Prochazka
The talk will be about the off-policy evaluation of the contextual multi-armed bandit applied to the vertical search blending problem in Seznam.cz web search engine. We highlight the advantages of the counterfactual off-policy evaluation approach over conventional online A/B testing and introduce basic counterfactual methods such as inverse propensity score (IPS) reward estimator. The counterfactual approach requires properly evaluated propensities for valid off-policy evaluation. The IPS estimate quality (its variance) depends on particular propensity values that are directly related to logging policy exploration.
Pavel Prochazka is a research engineer at Seznam.cz. He focuses on information retrieval related research applications into Seznam.cz web search engine. In particular, his interests include mainly learning to rank, counterfactual analysis and query understanding. Pavel received his Ph.D. in wireless communications from the Czech Technical University in Prague, where the main research focus was on iterative detection and Bayesian inference in wireless communication systems.
Read the Full Transcript
Erin LeDell: Okay. Thanks for coming. This is the Prague AI meet up, and deep learning meet up hosted by H2O.ai, which is a company with an office in Prague. We’re also in California, which is where I’m coming from which is why you don’t hear a Czech accent in my English.
But our second biggest office is in Prague. So we have quite a big team and you probably see some people with the T-shirts on around here. Those people probably work at H2o.
So before we get started, you were promised a drawing for a prize. So there’s the Prague ML 2020 conference coming up. I’m not sure exactly what month it is, but it’s coming up in the future and everybody who registered for this meetup sometime before today, I think, I don’t remember what time the cutoff was, maybe around noon today, we collected all the names and then we’re just going to draw a winner for a free ticket to this meetup. So first before we get started, I’m just going to do that real quick.
So we have copy pasted your names into this random name picker, and hopefully this works. We’ll try it out. So let’s get started. The winner is, [Meehal Stewart 00:00:02:11].
Meehal Stewart: Yup.
Erin LeDell: That’s you. Okay. Congratulations. We don’t have anything to give you right now. I’m going to take a screenshot of your name and then the organizers for the conference will send you an email for your ticket. So congratulations. Let me just record your name so I don’t forget.
All right. So, okay. The next thing I need to do, I just need to start recording, this thing here. Just make sure I’m doing this. Okay. So we should be recording this. So hopefully that will work.
So my name is Erin LeDell. I’m the Chief Machine Learning Scientist at H2o.ai, which I mentioned briefly already as a company, we have an office in Prague and in also Mountain View, California, and a few other places like in New York and we have a few smaller offices in Canada and India, and a few other countries. And what I’m going to talk about today is scalable, automatic machine learning in H2O. So all of those things we’ll touch on, the scalability, the AutoML components, and then H2O.
And so I’ll just say a little bit more about the company in case you’re not familiar. I think a lot of you, is this your first time at our meetup, if it is, could you raise your hand? Okay, cool. So probably a lot of you are new to H2O the software and also the company so I’ll just say a few things about us. So that’s the name of the company, it’s also the name of a software tool that we produce at H2O. So the company has been around since 2012 and it was started back then, if you can remember, if any of you were working in the machine learning or data science field back then, that was when everybody started talking about Big Data and Hadoop and things like that. So at that time there was no machine learning platform that would really work for big data or would be fast and scale to large data sets. So that was the goal of the company was to create a machine learning library for that.
And the software is open source. So everything that I’m talking about tonight, you can just download, however you like, you can use it through R or Python, those are the two main ways that people use the software, but you can also use it from Scala. The algorithms themselves are written in Java and then we also have a web interface so if you don’t want to write any code, you can also just use the web interface. So yeah. So H2O.ai Is the name of the company. That’s also our website, H2O.ai and H2O is the platform.
We also have a number of other tools that we produce at H2O. Some of them are open source, some of them are not. So if you see some of the signs around here, there’s one of our other big tools, which is also an AutoML tool and is called Driverless AI. I’m not going to be talking about that tonight, but that’s just something to be aware of, we have other tools that kind of specialize in different and things, and you can go to our website and learn more about the different things there. So, like I said, we’re in California.
Okay. So this is what I’m going to talk about tonight. So just first, I’m going to introduce what H2O is, what the platform is. Then I’m going to talk a little bit about what AutoML is or Automatic Machine Learning, and then I’ll show you how to use the AutoML functionality inside of H2O. And then I’ll give you some other resources that are related to H2O.
And then this URL here, if you, I don’t know if you can see it in the back, I think we have some screens in the back, so you might be able to see it, but if you click on this link, you’ll be able to find a place on the internet where we have all of our presentations from meetups. So this will bring you to a GitHub repository and if you scroll down there’ll be a folder that will say the date of today. And then it will say Prague AI meet up. And you’ll look inside there and there’ll be a PDF with the slides. So if you want to review them later, that’s where you can find them. So if you want to take a picture of that link or anything, feel free to do that. You can also ask me at the end of the talk.
Okay. So first we’ll talk about the H2O platform. So I mentioned before this was created in the time of Hadoop, when that was becoming quite popular. So what H2O is, is it’s a platform, so I would distinguish between a machine learning library and a platform, a platform would be something that also includes its own, sort of everything that it needs to function. So we actually have our own data frame, distributed data frame structure. So that would be maybe be different than like a scikit-learn, which is more of a library which uses external functionality to function.
So we actually implement the whole, everything that we need to do the machine learning including the data structures, from scratch. And all of the algorithms are implemented by us from scratch in Java. And they’ll work in multi-core on your single laptop. So if you wanted to have… So a normal laptop might have eight cores on it, it would be like eight times as fast if you run on a laptop and then if you have data that’s bigger than can fit into RAM on a single laptop or a single computer, you can start up a cluster of multiple notes and run it across that.
So I mentioned already that the algorithms are written in Java, but most people who use this library are data scientists and also engineers. But we found that data scientists generally like R and Python, those are the two most popular languages. So we make sure that anyone can use the library. And that’s also useful if you’re working in a team where some people like Python, some people like R, you don’t have to fight. Everybody can be using whatever interface they would like and producing the same thing at the end.
And then one of the other, or a couple of the other things that are unique about H2O is that because we are in Java, we can easily deploy these models into production. Something that is common when you’re using an R, Python library that’s maybe not designed for production uses, you have a team of data scientists who might prototype something in R, Python, and then you have another team of engineers who then reimplement what they did in Java or C++. And so you kind of have this duplication of efforts. And what we’re trying to do is make that more seamless so you’re not duplicating anything. It’s just one single system.
And then I mentioned a couple of times already that this can work on Hadoop. It can work on Spark, but if you don’t use those technologies, that doesn’t matter, you don’t have to use Hadoop or Spark. And in fact, if you don’t need to use those, it’s better not to. So you could just run it on your laptop. If you have pretty small data, or if you have very big data, you might have a bigger machine or a server or a cluster of machines.
Okay. So just a few terms that I’ll define. So this is maybe the most technical slide that I have in terms of talking about distributed computing. So we have two things. One is something that we call the H2O Cluster and people get a little bit confused about this. Sometimes we have a lot of people that think that this is like a cloud based service. So when they start up the H2O Cluster, that’s actually just running locally on your laptop or on your computer, or maybe you’re running it on Amazon EC2, or some other cloud provider, but you have the control over that. So there’s no such thing as like an H2O cloud where we’re running the tools and then giving you the results back.
So just to be clear, they show Cluster as a local thing that just runs on your machine. And really what it is, it’s just a Java process where everything happens. So that’s where your data is, that’s where your models are, that’s where you’re training the models. It’s just a block of memory and that’s what we call the H2O Cluster. And so the first thing that you do when you’re writing some code using H2O is you would start up the H2O Cluster and then you train some models, load data, etc.
And then the other thing that we have is something that we call the H2O Frame. And I mentioned before, when sort of defining H2O as a platform, we actually have our own distributed computing environment. So including distributed data frames. So if you’re using R or Python, you don’t have to think about, I mean, if you’re using any of the APIs, you don’t really have to think about this, it’s just, you can pretend it’s an R data frame or a Pandas data frame and it works just the same. It’s just underneath, if you’re using a cluster, some of the rows will be on one node, some of the rows will be on another node, and that’s how we distribute the data. And then all of the nodes will talk to each other when necessary and communicate as needed.
And the reason that we do that is because you might have a training set, that’s, let’s say 100 gigabytes of data. And if you want to train that on, sort of, a normal size machine, let’s say you have a computer with 64 gigs of RAM, or even like 20, 24 gigs of RAM or something like that, that data doesn’t fit into RAM. So that’s when we have to start having a cluster of computers and then some of the rows of that data will go on one machine, some will go on another, and then we revisit this whole idea of an H2O frame. So that’s why we were able to do that. And that’s why H2O is a scalable library. You just basically keep adding nodes to your H2O Cluster to make it big enough to fit the data.
And one of the reasons that we did that is because back around 2012, 2013, Amazon EC2 became like a very cheap way to do computing. So, and it’s still a very cheap way, at least for CPU’s, a very cheap way to do computing. So we’re trying to take advantage of like, what’s the cheapest way that we can do this. It’s very expensive to either buy or rent a machine that has one terabyte of RAM, but it’s very cheap to rent out 10 instances that are 64 gigs of RAM each, something like that. So the goal was to make something that could be run cheaply.
So inside of H2O, it looks like any kind of normal machine learning library, like scikit-learn. If your Python people in this room, you’re probably familiar with scikit-learn. So we’ll have similar functionality. So a bunch of different algorithms. So we have supervised algorithms and we have unsupervised algorithms. Some examples are gradient boosting machine, random forest, we have a deep neural networks, GLM. We also have XGboost, which is actually third party library that we bundle into H2O because it’s very good and we wanted it inside of H2O so we put it in there, and that’s the nice thing about open source. And a whole bunch of other algorithms. There’s a whole list of them in the user guide, which I’ll point you to later. Yeah, so a bunch of different machine learning algorithms, all of those are distributed, scalable, our own implementations with the exception of XGboost.
And then we also like to do things, this bullet point here differs a little bit from scikit-learn. So if you’re used to that library, you have to do a lot of things manually to process the data at the beginning. So if you have categorical data, you have to convert that into numeric data somehow before you can do machine learning. We like to do all of that stuff automatically so that you don’t have to repeat these steps every time you’re writing code. So we’ll handle some basic preprocessing of the data, so if you have missing data, we’ll fill that in for you. If you need to normalize the columns, we’ll do that for you. If you need to do something with the categorical data, we’ll do that for you like one-hot-encoding or some other type of categorical coding. So we’ll do all of that automatically.
Another nice feature of H2O is we have something that’s called automatic early stopping. So what that means is that a lot of algorithms, for example, like the GBM or gradient boosting machine, it’s easy to over fit those algorithms. So you train, and these are all sort of iterative algorithms, so they learn a little bit at a time and then at a certain point, you need to stop it from learning because then it kind of starts to memorize the data and not be able to predict well on future data. So you can sort of manually do that if you start looking at the training error versus the test error, and then sort of this, just thing that data scientists do to tune their models. We have something that can sort of automatically do that for you. So I find that to be a very useful.
And then things related to machine learning, but aren’t necessarily algorithms, we have the ability to do cross-validation, grid search, random search. So these are all just functionality that you need to tune and to evaluate your models. And then once you’ve trained the models, we have variable importance for everything. So you can understand which of your features are most predictive and important. We have all different evaluation metrics, so you can compare models against each other and some plots and some other things like that. So generally we like to just put everything that you can think of that you need to do for machine learning inside the single library so that you’re not having to use a whole bunch of other software at the same time.
Okay. So any questions about H2O so far. We’ll also have time for questions at the end. We’ll do like a Q&A. So, okay, you can save your questions for the end too.
Okay. So now we’re just going to talk about AutoML or automatic machine learning. So it’s not a very well defined term. So when somebody says AutoML, they might be talking about a lot of different things. So what I think is important to do is just talk about what are some of the goals of AutoML and some of the features because I don’t know that we can all agree upon exactly what AutoML is as a community, so I think let’s just talk about what are some of the goals with automatic machine learning.
So I think probably the most important goal, at least in my opinion, would be training the best model with the least amount of time or effort. And so that could mean the least amount of time in terms of training, like compute time, but also the amount of time that it takes for the user or the data scientists to write all the code and to make that happen. So there’s kind of two amounts of time that we’re trying to minimize. And both are important because if your models only take an hour to train, but it takes you five or six hours to write a bunch of code to do that. And then it doesn’t matter if it’s one hour and one hour and 10 minutes, it’s like the six hours ahead is actually the more annoying thing to reduce. So I think the goal is just getting the best model with the least amount of effort.
And if we make it easy, so another aspect of AutoML tools is they try to simplify the interface so that it’s very easy to use. So all you really should be having to do when you’re using an AutoML tool is point it at your data and say, “This is my training set. This is the thing that I want to predict.” And maybe you can also specify what is it that you’re trying to measure? Are you trying to maximize AUC? Are you trying to minimize mean squared error? So maybe you can specify what it is that your goal is in terms of a metric and then maybe how long do you want it to work for? So I think that pretty much covers what an interface for an AutoML tool should look like.
You could have other things that would allow you to override things or to have more advanced settings. But I think you shouldn’t have to do more than those three things. And so that also means that you don’t have all these different hyper-parameters and tuning things that you have to mess around with. And that makes it easier for people that haven’t been a data scientist for as long.
So let’s say you’ve been doing data science for like six months or a year, and maybe you know about a random forest, do you know about a GLM? Maybe you’ve heard of GBMs, but you haven’t tried them out yet. It takes a long time to learn how to tune all those different algorithms. And even just to know what all the algorithms are, and it could take years to really become an expert at that. But by the time you’ve been a data scientist for a couple of months, you kind of have the general idea about what you’re doing. You’re giving it a training set and you’re trying to get a good model and you know the thing you’re trying to predict. So if that’s all you need to know, then somebody who’s very new to data science could use an AutoML tool and get really good models fairly quickly.
So, that’s one aspect, is that it opens the door to a lot more people that have less experience or expertise, but also even if you’re an expert in machine learning, it’s just going to save you a lot of time because you’re not writing so much code all the time. So one of the reasons that we decided to make this tool is because, I mean, for me personally, I was writing the same code over and over again, and like copy pasting like a bunch of things that I always do every time I start a new machine learning problem. And it just didn’t make any sense to do that. So I had kind of a technique in my head of all the things I wanted to try. And then basically we just put a wrapper function around that.
And so this is just to make it easier to, even if you know how to do all that stuff and you know how to tune all the algorithms, this should make your process faster. So if you’re a data scientist at a company or somewhere, instead of working on a single problem for two weeks, maybe you could work on three or four or five problems in that same two weeks because you’re not spending so much time writing code and evaluating models.
And the last point is just, it’s kind of a nice thing. Let’s say you’re somebody in some scientific discipline and a lot of research right now in science is taking old problems and applying machine learning to it. And then you instantly have progress. So, let’s say you’re doing some sort of medical diagnostic thing and everybody’s been using some formula for 50 years and all of a sudden you have some data, you apply machine learning to it and you get a better result. So that’s a very easy way if you’re in academia. Also to write papers, you just find a bunch of things that haven’t been touched with machine learning before and you apply machine learning and then you have instant progress.
So what would be nice to see is in scientific disciplines is people using AutoML tools because then usually if you’re not, maybe you’re some other type of scientists, not a data scientist, you don’t know exactly what to do, you know the basics of machine learning, but you might not be able to get the best results that you could get if you’re just doing that on your own, so it might be nice to have a lot of people trying AutoML tools, and then in your paper you can write, “Here’s all the algorithms that the tool tried and here’s what we found was the best one.” And you kind of like, it makes your job easier and it makes the paper or the research better because it’s not just you train one random forest and said the result you did a whole exhaustive search. So I think that that can help improve science as well.
So here’s just a few aspects of machine learning in general. Or I should say data science maybe. But so in the process of having a data set and trying to train a model, there’s a couple different pieces. So at the beginning you are maybe more focused on preparing the data, maybe doing some feature engineering, some feature reduction, who knows, all sorts of stuff related to the data. And then once you have your data in a good place, then you can try to train a whole bunch of models and then, because you can’t really predict in advance what model is going to work well or what algorithm is going to work well on your dataset. So generally the process of data science or machine learning is training a whole bunch of models and then finding which one is the best.
I think that in the future of AutoML, that’s actually going to be what we do. We actually can just predict which algorithms are good and then we skip the whole spending a lot of money on the cloud, and then we can be done with our job easier. That’s not really where AutoML is yet, but that’s kind of where it’s headed. So this, but right now it’s still maybe too hard to predict which algorithms are going to be good so we just ended up having to try everything or try a whole bunch of things.
And then at the end, a technique that’s used often if you’re trying to get the best performance possible is to combine several models together. How many of you have heard of kaggle.com before? Okay. So maybe like half of the people, for those of you who don’t know what Kaggle is, excuse me, it’s a website where you compete against other data scientists to win prizes and basically you’ll have a competition and the competition centers around a particular data set and they say, “Here’s a data set, this is the thing that we’re trying to predict and we’re going to measure everybody based on some metric, like AUC or mean squared error.” And then everybody just tries to train a bunch of models and everybody fights against each other. And then they submit their predictions on a test set where only Kaggle has the answers and then Kaggle ranks everybody on what they call leaderboard. And so you see a big, basically table, of teams and then their scores and then after a certain amount of time, let’s say a month or two months, then the competition was over and then you have winners. And then, yeah. So then it’s just this whole community of people that love doing this.
And anyway, the point of me mentioning this is that part of winning in Kaggle is using ensembles. So it’s pretty hard to win Kaggle with just a single model. If you combine many models together, you get a more powerful model and that’s called ensembling. So if our goal is to get the best model, generally at the end, you want to ensemble them together.
Sometimes that’s not your goal. So sometimes maybe your goal is to have the best model with the fastest prediction speed, or maybe the most interpretable model or some other way of measuring what a good model is for you. But if all you care about is the performance, how well that thing is predicting what it’s trying to predict, then we just want to ensemble things together and combine the power of many models.
So here’s those same three categories, but I’m just kind of outlining a few more things in detail. So in the data preprocessing category, I’ve already mentioned the first bullet, this is all stuff that we do already in H2O, sort of for free. You don’t have to write any code to do it, then this could also include feature selection and feature extraction.
And then the third bullet is actually very important. And you’ll also see people doing this a lot on Kaggle. It’s just, if you have categorical data in your training set, that can sometimes present a lot of different issues for the machine learning algorithm. So in particular, if you have a column that has a lot of different categories, so one of the examples that we see this often is, let’s say you have some addresses or location data in your training set. And one of the columns is postal code. Well a postal code, I don’t, does anybody know how many postal codes there are in the Czech Republic? Anybody? It was probably a lot, but in the US it’s like 40,000 or something like that. So, maybe several thousand or 10,000 or more postal codes. So with regular.
PART 1 OF 4 ENDS [00:30:04]
Erin LeDell: Codes. So, most machine learning algorithms will require you to turn that data into some numeric representation. So the simplest thing to do is to take each of those 40,000 categories and make a new column, one for each level, and turn that into binary indicator columns. So you say this zip code, yes or no, zero or one. This zip code, yes or no, zero or one. And it just balloons the data out into like this much wider, more difficult representation of the data, and then even algorithms that don’t require that. So some tree based algorithms like random forest and GBM, if you have, for example, the H2O implementation doesn’t require you to do that encoding of the categories, but it will add a lot of time to the training time if you have 40,000 different levels. So that’s what I call like problematic data.
So one thing is you could just drop that column and forget about it, but it’s probably containing some useful information. So there’s different types of things that we can do to, to recode that data into something more amenable to machine learning. And one of the techniques is called target encoding. And that’s, I’m not going to explain what it is, but you essentially turn each category into a number, just like a single number, so that you can just keep one column and still get some value out of it. So these are all techniques that as you go on in your data science career, you’ll start to learn more and more tricks like this. So yeah, I would include that in the data preprocessing category.
Then in terms of model generation, there’s a lot of ways that you can just create models. So you could just manually create five or 10 models, your favorite ones, with the favorite hyper-parameters, or you could do something a bit more systematic, like create a grid search. So you’re sort of identifying all the hyper parameters that you want a tune. So for example, in a random forest, we might have like the maximum tree depth, or the sample rate, or the minimum number of observations per node. So these are all think about them as like little knobs that you have to tune. And there’s some combination of these knobs that gives you the best model, and your job as a data scientist is just to keep tuning these knobs until it spits out something good. And that’s essentially what we do. So it’s not very sophisticated, but this is basically what data science is. But there’s ways to sort of make an instruction set for how to do this in a more organized way.
So that would be grid search. And then you can also do kind of a random search where you just try a bunch of different values and then eventually you’ll find something good. Then there’s another technique that’s called Baysean Hyper Parameter Optimization. That’s a bit smarter of a technique, but it’s something that can’t be done in parallel. So you have to do one step at a time. So it takes longer to do that. So sometimes then it might make more sense to just do a big, random search in parallel. You’ll get to a better model quicker than something more sophisticated. So it just depends on a lot of different things, which techniques you want to use to generate models.
Okay, and then for ensembling there’s of course also a couple of different ways to do this. So the way that I’m going to talk about today is called stacking or stacked ensembles. How many people have heard of stacked ensembles before? Okay, cool. So that’s just the type of, sort of a technique or an algorithm that what it does is it makes a… Basically use machine learning to learn what the best combination of the models is. So it’s like two levels of machine learning happening. And that’s why it works well is because you don’t put any of your human ideas into what it should be. You just let the data learn, or the algorithm learn from the data, and then it comes up with some combination of how to combine those models that’s supposed to be best.
And then there’s another type of ensemble that’s called ensemble selection, but I’m not going to go into detail about that. But just be aware, there’s all different ways of doing things. And your goal as a data scientist is kind of forging a path through all these options as quick as you can to get to the best model. So it’s a lot of decisions that have to be made and a lot of considerations that you have to make.
Okay. Excuse me. So the next thing I’m just going to show you here is just a link. You probably can’t read that, but I’ll read it to you. It’s a tinyurl.com/flavors-of-automap. So this is just a blog post that you can read. If you want to learn more about all the different types of AutoML. So and by types, I kind of mean like the different techniques that are used to achieve the goal of finding the best model. So probably the biggest distinction in terms of AutoML techniques is stuff for deep learning and then stuff for not deep learning. So for deep learning, the goal is really just, it’s something that’s called neural architecture search. And the goal is really to find the optimal architecture of your network. And that’s essentially what AutoML is in deep learning world.
And that’s a very different approach. And so if you want to… If you’re finding yourself using deep learning, so let’s say you have image data… Basically if you have image data, and sometimes if you have text data, or more complex signals like audio or video, you’re probably going to need to use deep learning because that’s what that’s good at. But if you have tabular data, like it looks like a matrix or an Excel spreadsheet, then you can use deep learning, but you’re probably going to not be getting as good of results as with all the other techniques like gradient boosting machines, random forests. That’s not always the case, but that’s in general, how I think you can divide up machine learning into these two categories and it’s based on what your data looks like.
But a lot of times people, these days are just doing deep learning for everything because everybody’s talking about deep learning all the time. If you find yourself doing that, you should maybe also consider if you have tabular data. Some of these other algorithms that have been around for longer and… Well, deep learning has been around for just as long as everything else, but essentially, that would be my recommendation. So this just goes into some detail about all the different types of AutoML and some of the different software packages and open source tools as well.
Okay. So now I’m going to talk about how do we do AutoML inside H2O? Here’s just that same slide from before. And what I’m doing here is just highlighting all the stuff that we currently are doing with our AutoML. So before I’m just speaking, generally, here’s all this stuff that we could try to automate in all these different ways, but this is kind of what we’ve come up with for us. For data pre-processing, we have just this stuff here right now. We’re doing automatic target encoding, which is going to go into H2O AutoML very soon. You can still do manual target encoding and then apply AutoML right now, but we’re trying to automate that process. And then after that, we’re going to address the feature selection category, so if you have very wide data. Right now, you probably want to do something with that before using H2O AutoML.
But in the next versions, we’ll try to do all of this automatically. In terms of model generation, what we do is we use random grid search. And the reason that we do that is one, it’s very easy parallelize this, and it’s fast, and it’s what we already have in H2O. So it just makes sense to just use what we have. We’re probably going to add something like this, maybe not exactly Baysean Hyper Parameter Optimization. It might be something slightly different, but for now what we do is just a random search across all of our algorithms. And then we do the automatic early stopping to make sure we’re not over-fitting. And then we train a few different stack ensembles at the end. And so why do we do this? So one of the reasons that we chose this approach is because these two things work really well together.
They kind of work hand in hand together. So stacking works really well if you have a diverse set of models that go into the ensemble. So one sort of pretty easy way to get diversity is just to randomly generate models because then they’re pretty much different because there was nothing making them similar. So as an alternative, we could do something like Baysean Hyper Parameter Optimization, but in that process, all the models tend to be more similar to each other because each time it’s like learning how to tweak itself a little bit in some direction to get a better result.
So you kind of get this group of models that are all very correlated with each other. And so if you’re trying to do stack ensembles with that group of models, it’s actually, it doesn’t work as well. So even though this is, you could say less sophisticated in terms of a search, the result is better because we have this strong, diverse set of models. And then the stack ensemble will use machine learning to figure out how to combine them together. And if there’s models that are bad, it will also learn just to ignore their input. So that’s kind of the technique that we take that seems to work well.
So this is just an overview. So this is summarizing it. That’s also our sticker, which is in that little cup over there, if you want to AutoML sticker, make sure to find one. So basic data preprocessing, but we’re adding some more sophisticated stuff with categorical encodings later or right now. So we’re training the random grids across all these different algorithms. And this part, even though I’m just saying, quote, train a random grid, this takes a lot of time to figure out what this looks like. So we need to, for each algorithm decide which parameters are we going to tune and what ranges are we going to look at and how much time do we allocate to this versus that?
So this is kind of not as simple as it sounds when you just say, train a bunch of models, whatever. We have to put a lot of effort into thinking about how to best make use of our time. So if the user says, “You only have one hour,” We have to figure out what’s the best way to get a good model in one hour. And then we dynamically come up with some way to allocate the time across those different tasks. And then we train two stacked ensembles. One of them is just an ensemble with all the models. So that’s usually the best model at the end.
However, if you’re running AutoML for a long time, let’s say you run it for like 300 models or 1,000 models or something like that, then you have an ensemble that has a thousand models in it. And while it might be the best performing model, it’s probably not the fastest model to generate predictions. And deploying a 1,000 model ensemble into production just makes me a little nervous to do that because you have all these models, a lot more things can go wrong. It’s just a bit more complicated. So as an alternative, we train another stacked ensemble, which is smaller. And we call that the best of family ensemble, where we just take the best GBM, the best deep neural network, the best GLM. So the best from each algorithm class. And then we have kind of this lightweight ensemble, which is probably, if you’re comparing all the models together, probably in second place after the all models, but it’s still doing a good job and it’s easy to put that into production.
So it just depends on what your use case is, which one you want to use. And then at the end of the process of AutoML, what we give you is what we call a leaderboard. So that’s just a ranking of all the models in an order. And then whichever ones on the top is the winner. And then of course these are just all H2O models. So you could choose the winner. You could choose any model from the list that you train. They’re all sitting in memory, waiting for you to use them. And then they’re easy to deploy because these are just H2O models. So now what does this look like? So how many Python people do we have in the room?
Okay, mostly Python people. So this is what it would look like for you. So we import H2O. Let’s import the AutoML functionality here. And then, like I said, at the very beginning, the first thing that we do is we start up the H2O cluster. That’s what this does, H2O.init. And there’s a bunch of different arguments that can be passed here. But if you don’t specify anything, it will just start locally on your laptop. And this is basically our equivalent of pandas read CSV. You know, so it’s our import file function. And you can read CSV files. You can read all sorts of different things, different formats. CSV is obviously like the simplest things. That’s what I use in the example. They could be a local file. It could be on the internet somewhere, could be an HDFS location in Hadoop. It could be all sorts of things. So anyway, just loading some table basically of data, and that’s what we call train.
And then we’re going to do two steps here. One, we’re just going to create an object. Let’s call it AML, then set the H2O AutoML class. And there are several other arguments that we can specify here, but the very minimum that you need to do is just tell it how long do you want it to run for. So you could say, max run time seconds equals 600, and it will run for 10 minutes. Alternatively, if you want to specify the number of models, you can do that instead. So you could say, run this for 20 models and then stop. So you have some options about how you instruct the tool to do its work. And then once it’s set up, then you do the dot train method and you want to tell it which data and you have.
If you’re used to Scikit-learn, the interface here is slightly different. So in Scikit-learn they have an X and Y. And the X is the data, the predictor columns. And Y is the response column or it’s sort of like a one column data frame thing in, in H2O we have a slightly different interface. We have something called the training frame. That’s just your whole dataset. It includes all the predictors and the response. And then you just say which column is the thing you’re trying to predict. So Y equals, and then this could be whatever response column name. In general, your data might just have one column, that’s the response column, the outcome column, and then everything else is a predictor. But if you have like an ID column, or a date, or some other columns that you don’t really need, you can ignore those.
If you pass in X equals whatever, then you can just specify which subset of those columns do you actually want to use as predictors. That way you don’t have to create multiple copies of the data where one is like, you try some set of features, one you try another set. It just makes it a little bit easier. You can just always have one data frame and then try different things by just specifying the X argument there. So then when you hit train, then it will go off and train for a while. And then when it’s done, we have this object here. And then one of the things in the object is the leaderboard. And I’ll show you the leaderboard in a minute. How many R people in the room? Yay, I like R.
So here’s what it looks like in R, same thing basically, load the library, do the init, load some data. And then in this case, we just have one line of code instead of two, but the H2O.automl malfunction, same thing. And so between our R and Python API, sometimes there’s differences in how we do things like in the other one, we first created this object and then we invoked a dot train method. That’s just a more Pythonic or kind of like Scikit-learn way of doing things. So in R, we keep all the arguments made the same, so it’s easy to know what’s what, but I mean you just have a little bit different way that we set it up here, and then we have this leaderboard.
All right. And so those of you in the room who don’t want to write any code, then you can use our GUI. So, basically when you start the H2O cluster on any machine, let’s say you start it on your laptop. In the background, it’s actually starting up a web server as well. So at any time when you’re running the H2O cluster, if you type in some IP address import. So the default is local host, port 54321. If you hit that in your browser, then this thing will pop up and you can do everything by clicking around. So you can load data, click here, load some data. And then once you have some data in there, you can click here, run AutoML, and then this is the AutoML interface. So here we’re actually seeing more of the arguments actually exposed than I showed in the code here.
You clearly probably can’t read this from back there, but we have training frame. We have a way where you can turn on and off different algorithms. This is actually an old screenshot because it’s missing XG boost. So if you have maybe some more advanced knowledge and you know already that you don’t want to, let’s say you don’t want to do deep learning or GLM’s, you only want tree algorithms, you can just turn that off. So we’ll do them all by default because we like to be more exhaustive, but if you have some advanced knowledge or preference for certain algorithms, you can turn on and off certain things. And there’s a bunch of other things that you can play around with, but I won’t go into detail.
Okay, so this is what the leaderboard looks like. It’s essentially just a table. And inside the table, we have model ID. So this each all represents a model that was trained, and then we have different metrics for measuring the performance. So for a binary classification data set, there’s also some missing that we’ve added since I made the screenshot. By default, you can change whatever metric you care about. If you don’t say anything, we’ll just assume that we’re going to sort by AUC and we’ll rank the models that way. If you want to sort by log loss, then you can put that in the argument when you train, and then it will sort by it log loss or mean per class error, RMSE, MSC, and the one that’s missing is the area under the precision recall curve.
Yeah, so then what we see here is actually this is five fold, cross validated AUC. So when you train an AutoML process, by default, unless you change it, which you can, it will do five fold, cross validation of all the algorithms. So you might think that might be a wasting your time. Maybe you only want to do a twofold or maybe you don’t want to do cross validation at all. However, we do that by default because to train a stacked ensemble, you actually need to do cross validation. I shouldn’t say you need to, but that’s the best way to do it, and then you’ll get the best results. There’s another way where you can just do something. You can generate predictions on hold outset and train the stacked ensemble that way, but in general, this is the best way of doing it. You’re going to get the best results, but you might want to play around with it.
So if you have really big data and you don’t want to imagine doing five fold, cross validation on every model, you could turn it off and see if you can get better results by being five times faster, train five times the number of models in the same amount of time. You might get better results that way. And then these are all things that we’re also playing around with behind the scenes to see if we can automate some of these decisions for you. So what we see in this leaderboard is that this model here is called stacked ensembles, all model, stacked ensemble, all models. That’s the winner in this dataset. The next one has the best of family, AutoML. And then we have an XG boost model, a few more actually moves. There’s like five of them there and we have four GBM’s, another XG boost, some more GBM’s, a couple of random forests, deep learning, XG boost, deep learning, deep learning, and the very satisfied model of them all is the GLM.
Because GLM’s, they don’t do that well, but they’re very interpretable, so they’re highly used. You can’t read these numbers, but I’ll tell you that the bottom most model, the AUC is 0.682. And then as we get up here, it’s more like seven, eight, seven eight three seven eight three seven eight four. And that’s the highest that we get from a single model, seven, eight, four, and then it jumps up to seven, eight, eight, and seven, eight, nine for the stacked ensembles.
So, okay. All right, I don’t want to spend too much more time. I’ve been talking for a while. So I’m just going to skip over this pro tips section. You can review this on the slides if you download them. One thing that I will say though, is when you run the H2O init command, something that you’re probably not used to thinking about if you do R or Python is how much memory usage that you want to allocate for yourself. So if you’re using Scikit-learn, it just uses whatever memory is available on the machine. And you know, that’s that. Same with R. For H2O, we are more precise about how we allocate memory. So one of the arguments that you can set for H2O.init is how much memory you want to give this H2O cluster. So one of the things that you might want to think about is if you’re going to train a hundred models, they all have to fit in that memory. So you might want to give it more memory when you start up then the default.
Okay, I just want to go pretty fast through the rest of these things. So this is some of the stuff that we’ll have in the future. And you might wonder if we’re creating these tools that are automatic machine learning, couldn’t we just go on Kaggle and win all the money, right? That would be a good idea. I would say that that is a good idea, however, it doesn’t quite work that way yet. So unfortunately humans are, unfortunately for AutoML tool designers, humans are very good at being creative about feature engineering. And so if you give people two months to compete in a Kaggle competition, they’re going to get very creative and do all sorts of data related transformations that was probably going to help them win over just an algorithm based tool. And that’s actually one of the things that we do in the driverless AI other tool that we have at H2O is automatic feature engineering.
And that’s kind of like we’re trying to automate the heart. That’s actually the more hard part than automating the modeling. That’s fairly straightforward once you have the data already to go. So just for an example, if you have just like very short time periods… So, Kaggle has this conference, they call Kaggle Days. It’s just like a one day thing where they do like an eight hour competition. And yeah, everybody competes. It’s the same as a normal Kaggle competition, it’s just much shorter time period. If that’s the case, in fact AutoML tools are winning here. So there was this Kaggle Days, San Francisco, and I was basically just AutoML tools competing against each other. And actually a human did end up winning. They got the number one score. So I can’t say that we’ve solved machine learning in general, but you know, we’re getting there and this tool came in number eight, I just ran it for a 100 minutes and just was done. And then there’s the score.
So if you give people long enough, they’re going to be more creative than an AutoML tool, but that might change over time. And if you want to know more about how H2O AutoML compares to, there’s a handful of other open source AutoML tools. There’s not very many, but there’s a handful. You can visit this benchmark that I, along with a bunch of other AutoML toolmakers created together, because the thing about benchmarks is that usually the person who makes the benchmark is winning the benchmark. So that’s why most benchmarks are not good. So we, as an AutoML community came together and decided upon what is a fair benchmark? And then we all competed together. I think that’s the only way to really do proper benchmarking. And I’ll just skip through the results because I don’t want to spend any more time.
And if you don’t want to read that paper and you want this nice lady to read it to you, there’s something called the Kaggle Reading Group, where once a week, this woman Rachel who works at Kaggle reads papers. And so she read our paper in one hour. So if you’re feeling tired and want someone to read you a bedtime story, there you go.
Okay, so here’s some links and these are the best places to go learn. So this is the documentation, the user guide for AutoML, and then we have little tutorials that will teach you how to do it in R and Python. You pretty much already know everything just by seeing that one line of code, but this will help you. If you can copy paste it and then be done.
Okay, so this is just more links to all the things and yeah, this is actually the link, the full length of a link for where the slides go for this talk. So I’ll stop there and I’ll just ask if anybody has any questions. Yeah?
Speaker 1: What are the choice of Java?
PART 2 OF 4 ENDS [01:00:04]
Erin LeDell: Yeah.
Pavel Procházka: Why the choice of Java?
Erin LeDell: Why Java? Oh, that’s a good question. People do ask that often. Well, we wanted something fast, so that rules out quite a bit. So… A fast compiled language, so maybe C++ or Java [crosstalk]
Speaker 2: Rust.
Erin LeDell: Rust? Yeah, we could have done Rust. Yeah. I mean, so the reason that we chose Java is, one, it’s still widely used in enterprise systems and it’s popular. But the real reason is that the co-founder of H2O is this famous Java guy who invented the Java hotspot compiler, and so he worked at Sun Micro systems in the ’90s and was super Java guy. So, that’s why.
Pavel Procházka: But anyway, don’t you run most of the models on GPU, not in Java?
Erin LeDell: Not for H2O. So if you remember back in 2012 when we started this project, GPU’s were still not quite popular for doing machine learning. Now they are. So this whole framework is built for CPU’s. The only algorithm inside of H2O that can be run on GPU’s is XGBoost, and that’s just a third party library. So that’s another thing that there’s more of a focus on GPU’s and the driverless AI. So this one has been around for a while, and so we can’t very easily change it now. So, CPU. But good question. Anyone else? Yeah, in the back there.
Speaker 3: Can auto ML automatically recognize the target variable, or how?
Erin LeDell: No. There’s nothing really that we could, just looking at a stack of columns, figure out which is the thing you’re trying to predict. So that is one of the only things that, as a user, you have to tell it what you want. Yeah?
Speaker 4: Okay, the question is regarding because the level I can customize the system up to.
Erin LeDell: Yeah.
Speaker 4: Because let’s say I have a recommendation task and I have like old firm’s website, and they have to recommend make a backup for me. So I have some place to write my physical code, say how to transform the one to another.
Erin LeDell: Yeah.
Speaker 4: Which measure from directly program metrics in some language according to… I mean, metric which is used In recommendations probably not out of the box. So I would like to go deep inside, and is there any ways to do such things?
Erin LeDell: So, there’s two options. One is if you have a metric that we don’t have that you would like to see in there, we can add it if it makes sense. Another option is if we don’t have the resources to do it, it’s open source, you can add it yourself. The third option would be just to use H2O just as it is, and then use the predictions and then use some other, like a Python library that can compute your metric by just giving the predictions and the label and then… But then we couldn’t optimize for that. We could compute it after the fact, but in order to actually optimize for anything, that would have to go inside of H2O as a part of it. So it might be tricky.
Speaker 4: Okay.
Erin LeDell: It’s doable, but it depends maybe [crosstalk]
Speaker 4: So it means I would faster make it into Python [inaudible] digital, I would say.
Erin LeDell: Maybe, it depends. Yeah. More questions? Yeah.
Speaker 5: Yeah, I would like to ask which precautions does one have to take to avoid over fitting, to avoid having biasing of the model because this heavy boosting or stacking evokes me really like a tendency to have the model, say crippled by too much bias immediately.
Erin LeDell: Mm-hmm (affirmative).
Speaker 5: So which measures will it take to reduce the dimensional attitude, to reduce the size of the problem, to bring in certain robustness?
Erin LeDell: So we just use pretty standard methods of validation. So, we would just use cross validation to measure… In the early stopping parameters, there’s three things that you can specify. So you can say, which metric are you looking at? So that’s like AUC or something. Then we have something called stopping tolerance and then stopping rounds. Those two things control how sensitive we are to… If you can imagine the curve, if you’re trying to not over-fit, you have a curve where you have the training error, and then you have the validation error and you want to make sure that you stop when the validation error starts to go up.
So there’s kind of this flat area in between where you want to stop. So we have a way of specifying how flat and for how long do you let that be before you cut it off. So that’s something that we have some defaults for, but you can also play around with it. So if you find that H2O is maybe stopping too early or stopping too late, you can tweak that number a little bit so that it’s more or less sensitive. Yeah, but just standard validation metrics.
Speaker 5: Thank you.
Erin LeDell: Mm-hmm (affirmative). Maybe one more question? Yeah.
Pavel Procházka: Do you support more loss function or if I have some custom loss control, must I also apply it or…?
Erin LeDell: So yes and no. So we have a way, it’s kind of complicated because everything’s in Java, and so it would be nice to not have to write Java code to do that. So what we did was, for some algorithms, which you can have a custom loss function, like the GBM, for example. The algorithm itself supports that. So then what we did was, there’s a way to write code in Python, just define your loss function, and then you essentially point to that. Actually, is it Veronica that has a good blog post about that? Somebody, or was it Honzai? Yeah, Veronica in the back, she has a whole blog post about how to do that.
So essentially she’s the one that you want to talk to, or you can read her blog post. But you just write some Python code and define your loss function, and then you can sort of point to it. It lives in a file, basically, like a little Python file and with the function definition and then we’ll optimize that instead of the things that are internal to H2O. But it’s only supported in Python, so yeah.
Speaker 6: And GBM.
Erin LeDell: And for GBM’s, yeah. I think… I can’t remember. It’s definitely on our roadmap to add it to H2O. I can’t remember if we’ve actually, or sorry to auto ML. I can’t remember if we’ve actually added it yet or not. I don’t think we have. But it’s very easy just to add that one argument, and use the code that they wrote to do that and then… It would just apply to the GBM’s, and then you would have different defaults for the other ones.
Okay, so I’ll just leave it there so we have enough time for our next speaker. Thank you very much, and if you have more questions at the end, we can chat later.
Pavel Procházka: [inaudible] I would like to thank for invitation and nice to see so many people interested in this topic or at least in national [inaudible 00:08:12]. What this topic is about, there are some question. What does it mean? If you do not understand the title, don’t panic. There is time to understand, I will go through the slides and I believe that you will not only understand the title, but you will also find it useful in many or some of your application.
So what it will be about, it will be about vertical search blending in our search engine. Especially, we will be interested how to estimate performance on some model. We show that this specific program or this program leads to partial feedback system, this is some kind of obstruction. Then there will be some time to go through this obstruction and show some methods evaluating the performance. Finally, we applied this framework to our program.
So, let’s start. First of all, I would like to introduce, Seznam. I’m not sure if everyone here knows it. We are a Czech web portal. We have several services, and the most important for us be the web search engine. I will show our home page. Maybe… I’m sorry, there is some problem with resolution too, you see only part of that. But you can see that there are some news and many more. But what I said, the most important part for our talk, it’s also part of your work, is search engine.
So I suppose that everyone knows what search engine means. So if I ask a query, then I get a result… It’s nice. This results contains traditional, we say organic results. You know from Google, it’s the same basically. There are 10 organic results, and then there are also some, we call it verticals. This vertical is for instance, image. These images are blended between the organic results. The problem we are facing is how to blend these verticals into the organic results.
Well, so it’s a problem what we are solving. Let’s see [inaudible 01:11:35]. So if I’m a user, I’m asking a query Mars, and so such a result of… Search engine result page, this is a shortcut. And me as a user, I saw the result, and browse, scroll for instance, take some time on that. Then I make some click, if the result finds my expectation. Says now we like to have satisfied users. The satisfaction is [inaudible] only through this feedback.
Now, so it was from the user point of view, and now we are not only user, we are… We would like to know how the system works. So if I asked the query Mars, then they are asked several services, some of them respond. So we receive available verticals. Then there are 10 organic results, then there are some other types.
This is basically input to our model. It’s called logging policy. You can learn it for instance. Features like created itself, hardware and many more. The role of the logging policy is to compose the search. But it is quite difficult to make it advance. So we made some simplification that the logging policy basically predict probability that one particular vertical will be chosen. And then there is some random try out with this sample from this solution [inaudible] name action.
Well, once we make the first position, we remove [inaudible] focus on [inaudible] it’s based on the first place. And then the organic result is removed long ways and it goes again to the logging policy. It’s some propensities and so on. So this is the variable composing the search. And now the particular problem is when we have one model and we would like to improve that, how to get to know that you would meet the update would be better. So we are interested in the measurement… Performance measurement of the new model. Okay. So that’s the way, partly it’s explains the title.
So I summarize the talk or what I would like to achieve in this talk. First I introduced the program and I would like to show that it’s an instance of some kind of abstraction called partial feedback system. Then maybe you spend some time to check the possibilities of performance evaluation in partial feedback system. Finally, we apply this framework to our problem.
Well, the holy grail of this talk would be if you can imagine some [inaudible] application and it is also partial feedback system and advise the framework to… Well, some hints or tips of system that it’s also to the partial feedback system is for instance, spell checker in search engine because if user ask query, we can make a correction or we can put it… Search it as it is. And then we get different results. We do not know what it is correct to make correction or to do original query. So only thing we have is a feedback from the user. And this is command part for all the systems, for instance, recomender and so on
Okay. So let’s start with the partial feedback system obstruction. There are some basic definitions. Do not… Don’t be afraid, I try to make some analogy also with supervised learning. So I believe that everyone will be able to understand it well. So we have that context it’s for instance, query or device of the user. And it is drawn according to some distribution, which, then we have an action. The action is it could be, for instance, the chosen vertical which means you can choose a image or we can choose some picture and so on. And this is drawn according to policy or given model. And we have reward function. This is quite… This is basically what we would like to maximize it’s… It could be correct. And it is something that indicates that if the user was satisfied or not, and then we have the goal… Our goal is to maximize the expected reward. The expected receivable reward. We use some are, but don’t be scared, the most sophisticated variation would be expectations.
So this is from our definition of what we would like to optimize. Now what is crucial is that we have a partial feedback only. It means that we received a feedback only for one given action that was selected not… We have no idea what would happen if we would select a different one. To make the analogy for from classical machine learning, supervised learning, from texting the same, you can imagine for instance, list the task there, or this task. So you have X would be an image. Then a Y would be our label. The PI is propensity in this counterfactual setting, and it corresponds to a [inaudible] predictive distribution of some discriminative model, this trend on… In full feedback system.
Then we have reward function and it’s counter part would be a loss function. Of loss. And aspect of the loss. So we would like to minimize the loss and now we would like to maximize the reward. And the most important part in the case of [inaudible] , if one image, [inaudible] the responsible for all impossible insurance. So we know that one image would be either on, two, three, four and so on. And the label is that it is for instance four, but we know for all of the other images that it is not true, but this is not the case in the partial feedback system. In partial feedback system it just [inaudible] one action. And we missed, if I collect for that, the feedback will be it’s correct or it will be something else, then the answer would be no, it’s not correct and so on.
Okay. So there’s a partial feedback system and there is one more assumption about stationary, it means that the policy should not first, the distribution should not change with time. And also that the policy should not depend, sorry, affect the input distribution PX. Which is not strictly to look through and we have to pay some attention on it. Okay, so this is the partial feedback system, a large part of [inaudible] such estimate. Well, we have through the board and there is an expectation which needs an analytical form of [inaudible] function, which is not available, of course. So we have only some samples and we approached this through expected reward only from something simple. So this reward is random and we will be interested in two quantities. The first one is bias and the second one is variance.
The unbiased [inaudible] means that this is the orange one. It is at zero. Meaning that it’s much larger variance than the blue one. Okay. So I said that we are interested in all kinds of evaluation, [inaudible] work offline, but it is possible to make it also online. It means that the… I’ve had new model and the point that in progression. We collect some data and it can be shown that it directly approximated design [inaudible 00:24:29]. So, this method is very straight forward, there is no need of assumptions and well, so we can say that the problem is solved, but there are also some probiotic visit method.
First of all, the method must be deployed in production. So you have to program, or you have to program it in the production quality, then it takes some time to collect sufficient metadata for the policy evolution. Another drawback is that if you test a bad policy, you can… You are actually showing the better results to users, which is also not very good. And it means that it’s difficult and the last point is that we can test only one method in one time. So if you would like to test many methods, then it’s… It could be problematic. Okay, so it would be nice to have some alternative, but before that, just simple example of how to write the report. It’s nothing surprising that the expectation is just on a numerical average, and, well it means that if I have certain data, then I just… I need only the report and I’ll have directory estimate of record. Okay, so let’s look for an alternative for… Of online learning. The first one is… Sorry.
No. So I formalized the problem. We have some data that are recorded according to logging policy, some model. We won’t say any details on that, just one. And then we have the new policy and we ask, how to use the data that are produced by the logging policy for evaluation of the new one. [inaudible] well, if he can make it, then we solved the problem of AB test because it’s cheap. We can test as many policies as we want. I can make it even in my, [inaudible] and it can be also then used for the training.
There are some challenges with that. That one would say that we can make it average as before, but the problem is that the data are not collected according to the alternative policy. So the direct calculation of the mean as before leads to estimation of the login policy but it’s what we need. And another challenge is once we make some estimate, the program is how to check the data because it’s on the offline estimate. So we introduced some validity, some sanity checks and sanity checks actually… Does not actually say that the estimate is correct, but it say to us, if something is… Does not hold, the estimate is not reliable. So we can actually provide some reflection and find when it is valid.
But the final validity should be confirmed with the AB test as I said. Okay, so I spent a few… Sometime to… For introduction surrounding my thoughts, solving this missing feedback. The first one is direct reward estimator. It means that we train a model [inaudible] which actually predict the reward. In the supervised component, it seems quite funny because it’s something… The loss function is almost always available for every-
PART 3 OF 4 ENDS [01:30:04]
Pavel Procházka: … almost always available for every power, every data. And in fact, here we are trying to predict the lost function in this analogy. But this report is actually quite complex, and the prediction often does not work well. Well, if you would like to [inaudible] that, it’s quite straightforward. We just use the model instead of true parameter, and then according to Monte Carlo Approximation, we arrive at [inaudible]. So, again if we arrive at this concrete in example, we can see that the reward is estimated only based on the model prediction from the data set, we need only to go distribution. Only distribution on the data line itself.
We don’t utilize propensity, although the policy or the result that was collected by the new policy. Okay, so we get some estimated which seems to be quite reasonable, but for this data, we are not able to… Well, it’s just demonstration, we cannot explain that according to four or six data points we are able to estimate the model properties. A completely different approach is used in less propensity score, estimated. In the direct method we tried to estimate the worth, I mean the reverse function. But now, instead of estimating the reward function, we tried to somehow phase the distribution that… The data approach, from the logging policy, and use it for the other alternative policy.
It can be shown, it is surely this quality that, multiplying it by this time, we arrive from the logging policy to… Sorry, from the alternative policy to the logging policy, and now we achieve what we actually need, because the general problem was that the data was not collected according to distribution we needed, but making this smart step, we can now directly use a Monte Carlo Approximation to estimate the reward.
Well, as I said, we do not need an estimate of true reward, so this estimate does not suffer the program, by that program. And it can be shown that this estimate is unbiased, and it’s supposed to be the higher param’s, I show later why. It can be also seen here, that the logging policy must be random, it shouldn’t be terministic, because there is a division by π0, and the terministic policy means that it is done easier, so if we need that, it makes no sense.
So once again, see other example. Here we actually use only three the top ones, because as you can see, the alternative policy selected the same action as the logging policy, And this is actually the reason why IPS signature has so high variable, because it efficiently use much response in the Monte Carlo Estimation. So, it’s intuitively why it has much higher variants. And there is another point, it’s what’s considered… We are actually estimating the true range from zero to one, and this estimate gives 17, which is completely out of range. The problem is that the propensity here is very low, so the reliability of this estimate depends on the proportion that we have, because here it occurred quite unprobable event, because there is only one person that it would happen, and we have only six data points. So, what we feel that there is quite imbalance in this. And this is actually what is some kind of over-fitting, because the estimate would more reflect this singularity, or quite unprobable event than through reward estimate.
So, if I summarize these two approaches, the direct method is quite straightforward. It tries to model the reward function. It suffers some bias, by some bias, because the record-keeping is often difficult, whereas the ideas is unbiased, but suffers with high variant. So, there are advanced methods that handles the disadvantages of this method, of the IPS and direct method. The first one is [inaudible] estimator and the second one is self normalized. I will just quickly go through that. The self normalized estimator tries to reduce the variants by normalization, or if you want, regularization constant. The idea is that this constant should be one and tens to infinity. So, we add some bias, but this bias should be small, because some tends to zero then there will actually be no bias, and we reduce the variants.
Well, there is one more important point. Since we do not need the estimate of the report, we can actually use this current estimation for any metric recording, some varied time or you can imagine any feedback metric, this metric is possible, can be estimated by using this estimator, there’s no necessity to model it. It’s a huge advantage which motivate us actually to use this estimator in our problem that I will show later.
I will just show again the example. We can see that using this normalization we get something between zero and one, which is quite reasonable, but still this very small and unprobable event is dominating, so the estimate would be… It partly fixes the problem of IPS, but it does not solve it at all. Well, one more thing that I forgot to mention, is that the normalization constant should be one, and it is also proposed by [inaudible] and the second guy. It is a sanity check for the estimate. It means if the constant is far from one, we say that the estimate is not valid, because essentially it simply doesn’t hold. So, it’s the first sanity check we will use later, and the last estimator is Doubly Robust, it’s maybe [inaudible], but it’s quite easy because it’s basically combined the direct and IPS method, and combined in that way that there is a direct method, this is the first part, and IPS estimate of this different. The idea is that if either the direct method or IPS method is correct, then the Doubly Robust gives the correct reward.
To get some intuition, why does it hold? If you have IPS that estimated this unbias, but it is high variance, and direct method which has low variance, but it’s biased, then if you sum that you get the blue line, which is biased and also had very much variance. But if you make the IPS estimate of the correction, then it is actually this one. It is also biased, and it has larger variance, so you can somehow combine the good properties of both estimators, so that you get an unbiased estimator of this low variance. So the idea behind the Doubly Robust. Another feature is that this estimator uses all data we have available, so it uses the complete data set and also the model that’s predictable. So, from the intuition in this… This should be the best model, because it uses all data we available, that’s also the motivation for using that.
So, right now we have several methods. How to provide Off-Policy evaluation of a bunch of feedbacks this time. And now, we didn’t say anything about training. Well, when we have defined the optimization objective, I have already shown you in several ways. We can also find some way how to this optimization objective optimize. The first approach is direct method. It means that we try and predict for all possible actions, basically the work model as I said before, and then we apply that to choose the best action according to the predicted reward. The second approach is the counterfactual approach. It looks on this from a little bit different way, because in direct method you need to make basically, a regulation model for all possible actions, which is quite complex task, in the second step, you’ll need to select the best reward. If you think about that, if you have a few discrete actions, the solution is a classification task. So, this is the counterfactual approach. I will not talk about the details. If someone is interested in, I recommend the video from [inaudible] from CKR 2016.
For us, it is important that this algorithm is implemented and we can simply use it in [inaudible] that way. I’m not sure if its also implemented in H2o, but for [inaudible], I’m not sure. So, if you are interested in this kind of training, I recommend to see the GitHub. There are very simple examples, and we actually used [inaudible].
Okay, so now we have everything we need. We know how to evaluate the Off-Policy, policy in offline way, and also we know how to train the model, or at least we are able to use the software to train it. So, let’s join the first and second part together. We define two setups. The first one is Per-Position Setup, where basically the context is the query and feature, corresponding features, and I said it’s Per-Position, so it’s fixed for position. So, there are also actions that were selected under other positions above. This is the context, the reward is only click.
And action is, what vertical is selected at the given position, and propensity is clearly logged, as I shown in the first slide. Now, since there is a long number of actions, we can expect a reasonable vary of propensities, so we can expect it to do, as I said before, should be valid.
But the problem of this setup is that the interpretation is not clear, except in the first position that we just say that the expected reward is number of place of the first position. But on the page, for example fourth position, it says, “What would be the reward if the logging process would be apart from the first three positions?” And on the first position, that would be would the automatic process. So, it tells us something, but it’s nothing what [inaudible] would like to hear. So, we also consider another set up, and it’s complete SERP. Here the context is just the features of the [inaudible] variant means query and so on. The reward function is any feedback measure on complete SERP, and the action now complete SERP composition. The propensities can be calculated according to chain rules, since we said that the result on the second position depends only on the first position, the result on the third position depends only on results for second position and so on. So, it’s possible to apply this general rule to express the propensity.
The problem is that this, well, the interpretation isn’t quite clear, because we basically report a final reward function. The problem is that if we multiply many small numbers, then we get extra small number, which caused the problem of the small propensities as actually before. So, we consider par-metric K, which is the SERP-length. It actually means, how much position of SERP do we consider? We can consider the first one, only first position, then this setup will be this equaled to the previous one for the first position. But we can balance with this parameter. The problem of the expressiveness, it means what actually the result tell us, and on the other, hand log the complexity that results in complexity and [inaudible] the validity of the reward estimation.
So, now this just summary of the previous two slides, action for first position, this is result in one position in the complete SERP. The propensity is just a single number, and it’s directly logged. In the second case, it’s product, and which is a little bit more complex, and so on. Well, the purpose how we use that, is that we use the fixed position setup for the training models, and then we relate it on complete SERP, it will actually give us the result management would like to see, the complete results.So, time is coming. There is one more sanity check that is in use by our data set, and it’s based on non-decreasing property of clicks through rate. I mean, if I provide evaluation on the first position, and summed up all the [inaudible], then I get some number, and if I may use for the first two positions, then the number must be at least the same or probably higher.
So, if the estimates show me that this function of S will not be [inaudible] then there is a problem. So, it’s another sanity check we will use during our evaluation. Based on just summary of evaluation setup, we trained developers all of the methods. I talk about using [inaudible], and evaluated the [inaudible] IPS estimates. We are also looking at random policies. We [inaudible] the sanity checks, I can talk about, and evaluated metric. And our goal is to find parametric K, such that the K should be as large as possible, but the result should still be reasonable. It means that they do not evaluate the sanity checks. So, these are some results on our data. We can see that we applied both sanity checks, the IPS estimator does not fit the first sanity check with normalization constant, it should be one. And we found that it is reasonable to consider only first two position, because on the third position then the sanity check is validated.
So, our interpretation is that we can consider only first two positions, but that’s not too much. The reason is probably that our policy, our logging policy, does not randomize as it should, because if it would randomize better, or it would randomize more, then we expect that the results should be better. So, some concluding remarks. We showed that the Off-Policy evaluation has some good properties, it saves a lot of resources, so it can be used as either alternative or component for A/B testing. Then the policy evaluation [inaudible] system is not for free, because you can either use A/B testing when you spend resources for something that is possibly not optimal, or you can use offline evaluation, but then you need some randomization, and this randomization is also not for free. We talk about the validity issue, and importance of sanity checks that actually bonded on our [inaudible]. We outlined how to Off-Policy learn the data, at least buy some time.
So, if someone is interesting, there is the data set from our search engin that is available on GitHub. You can reproduce these results or play it with more. The problem with the data set, as I said, it does not randomize too much, so it’s not well suited for counterfactual analysis, but there’s still a lot of data from real traffic that could be interesting, interesting for everyone.
Okay, so these are the references. Maybe I would like to point out the tutorial from [inaudible], it basically covers the second part of this talk, And the third part is covered in our CKR publication, where we describe the data set. So, if someone is interested in this problem, you can check if you like to join us, and that’s all.
So, thank you for your attention, and now if you have question, I hope that you could not get lost so early. So if you have question, if I can make something clear.
Speaker 7: Okay, if I understand try to [inaudible], yes, using this technique of offline evaluation.
Pavel Procházka: Yes.
Speaker 7: So, did you try to make that A/B testing [inaudible]
Pavel Procházka: Good question. We actually did not so far, because it has no priority.
Speaker 7: Oh, okay.
Pavel Procházka: So, it’s in progress, not progress, it’s waiting. But we are in touch with some other groups working on this approach, and I have messages that somewhere it was okay, the results were as expected, and somewhere it doesn’t work. If you use it, it’s not still… You are not still not confident that it will work fine in A/B test, or it will match the final result. So, we are still waiting for this.
Okay, other questions? Okay, so if there are no questions. Thank you. If you have any interesting [inaudible], feel free to contact me and can discuss.
PART 4 OF 4 ENDS [01:59:49]