Introduction to H2O Driverless AI by Marios Michailidis
This session was held at Dive into H2O: London on June 17, 2019 with Marios Michailidis from H2O.ai. He is also a Kaggle Grandmaster. Marios explains the machine learning workflow and process of H2Os driverless AI. He also talks about the various features included with the platform.
- ML Workflow
- Driverless AI Process
- Automatic Visulization (AutoViz)
- Categorical Features
- Numerical Features
- Feature Engineering: Interactions
- Feature Engineering: Text
- Feature Engineering: Time Series
- Common Open Source Packages
- Hyper Parameter
- Validation Approach
- Genetic Algorithm Approach
- Feature Ranking Based on Permutations
- Machine Learning Interpretability
- Interpretability and Accuracy Trade-Off
- Machine Learning Interpretability Examples
Marios Michailidis, Competitive Data Scientist, Kaggle Grandmaster, H2O.ai
Read the Full Transcript
Just a few words about me. I work as a competitive data scientist for H2O. H2O is a company that primarily creates software in the predictive analytic space and machine learning. So my main job is to make them well as predictive as possible. I've done my PhD in machine learning in the university, Collins London. My focus was in en sample methods as in how we can combine all the different algorithms available out there in order to get a stronger, a better prediction. And I'm not sure if, if people have heard about Kaggle. Kaggle is, is the world's biggest predictive modeling platform where different companies would post different data challenges, and they will ask data scientists to solve them. And there is a league, there is a ranking for those who are able to solve these best. So at some point I've won multiple competitions.
I was able to get to the top spot there out of half a million data scientists, but the, what I take away from this is that I have been able to participate a lot in different challenges. See all the different problems companies have in this machine learning space and be able to incorporate back at least some of it into our products and make them more efficient and more predictive. as this was mentioned before its tools goal is to democratize AI is to make certain that people can use and get and leverage the benefits of AI without thinking that it is so difficult in order to enter that field. We have a very and we are very proud that we have a very big community, primarily the empowered by our open source community.
We have around 200,000 data scientists using a product and some major organizations and banks are using it. In general, I would say we have two main products. We have the open source suit of products where we have the main aids tool library, which is available in a different programming languages like Python, R, Java, and it contains many machine learning algorithms and applications in a distributed manner. Then we've taken this and we've put it into we try to make it more efficient to work in a Spark environment and we call that sparkling water. And then we also, we have a version which is a bit more optimized for GPU, but what I will be more focusing today is this tool we have called driverless AI, where it tries to automate many steps in the machine learning process, trying to give you a good result fairly quickly.
And I will go into deeper about what this specific product does using this tool. We have had some success in this competitive environment mentioned before Kaggle, not sure why the slide moves so quickly. For example, there was this competition hosted by PMP Barbas, where travelers was able to get top 10 out of 3000 teams within two hours. And I know that was super hard because I also participated in it took me around two weeks maybe three weeks to get near where driverless AI was able to get within two hours. So that's an idea that can give you what you can do, how much predictive power you can get by using a tool like driverless AI. Generally the typical workflow when you try to work on a data science problem in general terms quite often looks like the one on the screen where you normally have a data integration phase, which is essentially you try to gather to collect all your data from different data sources, maybe different tables, different SQL databases.
And you try to put it together after doing multiple joints into one tabular file. Whereas it is, let's say in one line is every sample. Every row in the dataset is let's say one customer. And then, normally starts a very, very iterative process from a machine learning point of view or a data science point of view, where you try to do multiple experimentations playing with different algorithms, trying with after defining different validation strategies where you keep seeing what kind of results you get and keep reiterating until you improve on this problem and get the best results possible. And this is where driverless AI is primarily. So once you have that data set, which you have some properties that you would like to predict and build algorithms based on this is where driverless takes over. It will use this multiple machine learning applications in order to be able to get best results possible, given some constraints. In order to be a bit more specific, the way it would work is once you have that tabular dataset, you can imagine something in an Excel format, then you normally have a target variable, something that you try to predict out of this dataset, for example, will someone can predict somebody these age based on some characteristics, or can I, which is a regression problem, or can I predict if someone will default on his or her loan given some past credit history data which is essentially a binary classification problem.
Driverless AI Process
So travelers can handle multiple different types of what we call supervised problems. The next thing that you need to do is to define an objective, a measure of success. So do I want to maximize a form of accuracy or do I want to minimize a form of error there, various different objectives you can specify to make your model focus in specific areas, and then you allocate some resources. Obviously you are bound to the hardware you're running driverless on, but also you have the ability to control how much intensity the software puts on maximizing accuracy and how much time is spent on doing that. So you can always make accuracy also a function of time if you don't have much time you can essentially say to driverless try to do the best, what you can quickly or if you had a lot of time, normally driverless can get higher accuracy.
So bio controlling accuracy and given the hardware limitations and how much time you have available, then travelers will use all this mix in order to start giving you some outputs. So those outputs coming multiple forms. It could be some insight, generally insight and visualizations. It will be what we call feature engineering. So when you normally put some data, most machine learning algorithms prefer the data in a certain format. And I will explain later more about this, but you also have the option to through drive let's extract this transformed view of the data that can maximize your accuracy and try your own algorithms. For example, on it, you can also get the predictions for your problem based on the algorithms that we use. And there is also a module called machine learning. Interpretability where it becomes increasingly more important as it promotes accountability and essentially says is the process of trying to explain how a model makes predictions in humanly readable terms?
Automatic Visulization (AutoViz)
Can I understand how my black box model works in simple terms? It is a very important area that is a lot of focus in order to make certain that businesses can trust AI in today's world. So our visualization we have a great guy behind in Leland Wilkinson. He has written the grammar of graphics. He was one of the first people within Tableau, and he has built a really clever automated visualization process which consist of several algorithms that they scan through your data. And they try to find interesting patterns. And as I said before, highlight again, this process is purely automated, so it will not search for everything. It will actually search for everything, but it will not show you everything. It will show you only patterns from within the data, which are important.
For example, we have some graphs that focus on outliers, so it will focus on showcasing to you and highlighting features within your data, which contain outliers and pinpoint them so that you're able to to see them and determine whether you want them in or not, whether they're mistakes or just extreme cases. Other graph focus on correlations and clusterings within your data. Others could be hit maps generally is a comprehensive and detailed process. Which is completely automated and focuses to give you a quick insight about your data automatically. So if you don't know how to look on what to search for this visualization process can find out some quick good patterns in order to give you general insight about what your data is about. And now I'm moving on to the other phase that actually driverless AI spends quite some time, because from also my experience, most of the predictive powers is within your features and how you transform them is very important and critical in order to be able to get good results.
Let me give you an example. You have a feature in your data where you it's called animal takes different distinct values like dog, cat, and fish. Let's say you have a target where you try to predict cost. And this is just one of your input features. There are multiple ways to represent this variable. Most algorithms tend to understand numbers. They don't understand letters, even if you use even some applications use letters in reality, they use a new medical representation under the hood. So one way to transform it could be to use something called frequency coding, where you discount, how many times dog or cat appears in your data. And you replace with this. Then you have a variable that says how popular its animal is. Or you could use something which is more quick it's called label and coding, where you just sort all the unique values of the animal and incrementally you sign a unique index, or you could use something called dam coding, or one hot and coding where you treat each one of distinct categories of animal as as a binary feature.
So is it a dog? Yes or no. Is it a cat? Yes or no? something else you can do, as I mentioned before. So since you try to predict cost, essentially is your target variable. What you could do is estimate the average cost from your categories in this case, the animals, and just create a feature that maps this. So you already has a feature that maps the target. And quite often, especially if you have lots of categories, this kind of representation can really help algorithms to convert to a good result more quickly. And there are many different flavors of all these transformations. I'm just showing you on a high level, what are the different transformations we might consider? And we always search from for the best ones. And the answer is not always clear. Sometimes you really need to go through in order to all of them in order to be able to find which one works best.
Other type of transformations we might consider is imagine you have a continuous variable age, and let's say you try to predict income. So this is quite often, not a very straightforward relationship. And by this, I mean, when you are young, your income is low, and then it increases by quite a fast pace. When you reach middle age you know, slowly the income starts still increasing, but at a lower pace. And then once you go towards retirement, income starts decreasing. So there are shifts in the relationship between your input feature and what you try to predict, which is income. So being able to spot this and create some features where this specifically point to this changes in relationship, for example, through binning, by transforming the new medical features to categorical say that instead of having a new medical feature, say, this is the band that falls from this age to this age.
Normally you can really help some algorithms to drive better performance. Other form of transformations could be how you replace missing value. You could use the mean or the mode or, or the median or you could just treat it as a different category in a categorical context. And there are other transformations you can consider too, like dating the log or square root of a numerical feature. Sometimes this form of scalings can help to minimize the impact of extreme values and help some algorithms convert faster and give you better results. Other type of features we are considering is interactions among features. Can we create one more focused and more powerful feature by combining two together, for example, we could multiply or add two features and other form of mathematical operations, or if we have two categorical features, we might just concatinate them, add one single string, or if we have a numerical categorical feature, we can explore interactions in the form of group by statement can estimate the average age of a dog cat in this case.
Feature Engineering: Interactions
And you just create a variable that showcase this, and you don't need to limit yourself to only averages. It can be maximum value, standard deviations. Any form of descriptive statistic can go here or you could even bin it through the technique I saw you before create those bands, convert it to categorical, and then use this the concation technique in order to make it as one bigger strength. Similarly, if text has its own way of being represented in order to get the most when you use it with some machine learning algorithms, something that quite often we will do is out of all the possible words that you have in your data. So in all euros, you might have a, a field which is called description. What we are doing is essentially we are tokenizing. So we are breaking down each word into a single feature, a single variable, essentially.
Feature Engineering: Text
And then we say, how many times each word appears in its row, out of all the possible words. So we call that the term frequency metrics, which it has different it can come with different versions and flavors, but that's the basic idea that some words are very indicative about the context of what the sentence tries to say. And there are other techniques as well. Obviously there are way to, pre-process the text, for example, applying stemming, which is removing the suffixes from words, you might have playing, but the core of the word display. So you can just use this in your analysis. Obviously spell checking trying different combinations of words, removing words that get repeated quite often, that they do not match value HE/SHE, and you, me there are other techniques that can help you to decompress these huge metrics of all the possible words to a few explainable, to a fewer, fewer dimensions can use something like singular value decomposition to do this.
We could use something called word Tove it's based on deep learning. And it tries to represent its word with a series of numbers in a way that you can do mathematical operations between words. So if from the word king, I abstract the word, man, the closer result that comes out is the word when, and I have seen it working this way, it doesn't always work so interestingly but it definitely, this representation can give you very good insight about what a word is really about. Therefore it can give you features derived from this presentation can help you a lot in NLP problems.
Feature Engineering: Time Series
Other feature engineering is applied to time series data. When in the most simple form, you may just try to decompose the date. So which day of the month it is, which year weekday week number, if it is a holiday but quite often the features get derive from the actual target variable versus time. So I want to predict sales today. Can I use the sales yesterday or the sales two days ago as my features in order to predict sales today? So essentially like one and like two, or I could even try to take aggregated measures or windows based on this lag values. For example, create a moving averages for the same periods. This is extremely basic. I'm only touching the surface here of what the software does, but this is just to give you high level idea of the different features that the software will explore on trying to make better predictions for different problems.
Common Open Source Packages
This is some of the packages that, that we use. These are not the only ones, but some of the most well known ones. The key point here is that we obviously capitalize on our open source heritage. And we use many of alas, but at the same time, we also use other open source tools which have done extremely well. They have won multiple awards and they have done also extremely well in the competitive context, like lightGBM for Microsoft and gradient and exit boost for gradient boosting applications, random forests from Scikit learn get us with a tensor flow backend for our deep planning implementations. A lot of our data managing happens using SciPy in NumPy and pandas, but we are also slowly transferring to data table. It's an open source tool is something that H2O develops. And we think, it doesn't have the depth of pandas yet in terms of functionality, but it's extremely efficient and very quick handles memory extremely well.
And supports most of the major operations. I advise you to have a look, if you haven't tried it. It's available in R as well. So both Python and R obviously just picking a machine learning algorithm out of the box is not going to give you the best results. So all these algorithms are heavily parameterized. They contain a lot of hyper parameters that you need to tune in order to make them perform well for a specific problem, consider an exit boost algorithm, which is essentially a form of a weighted random forest. Whereas in each tree, you could control how deep you should make its tree depth through the tree. What different loss functions you can use to expand your trees, or what should be the learning rate, how much each tree should rely on the previous one, when it gives you predictions, how many trees you should put in that random forest.
So and this is just some basic parameters that are a lot more, but in order to be able to get good results, you need to find some good parameters for these algorithms. And this is something that travel's AI also does automatically, as well as for any other feature transformation that you've seen in order to make good decisions within driverless AI. We try to find internally a good way to test. We try to internally create a good testing environment so that we can try a lot of different things, a lot of different transformations and algorithms and have the confidence that it will, they will work well in some unobserved data. So for example, in a time series approach, most of the time where time is very important, we can use different variants, but the basic idea is that we always train on past data and we validate our models on future data.
There can be various flavors to it. One we like to use a lot is a validation with many moving windows or with the rolling windows where we will build multiple models on different periods, always sifting that validation window towards the past, that test window, essentially towards the past building models with any data you have before that, as a way to make certain you have a model that can generalize well in any period. When the data is essentially random in respect of time we will most probably use a form of K full cross-validation where, what this says is I'm going to divide to separate my dataset into K parts. It doesn't need to be sequential like what I have, but as an example. And then for K times, you're going to take a part of the data you're going to fit an algorithm or try different hyper parameters.
And then you're going to make predictions on that other part of the data and save the results. So how well you've done for example, in terms of accuracy, and you will repeat this process, having a different part of the data now being as test or as hold out and you repeat this process multiple time until essentially every part of your data has become hold out at some point has become part of your test, and then you can get an aggregated metric for how well you've done. And then you can gauge how good was that algorithm you use and how good were the hyper parameters you selected for that algorithm, as well as whether the feature transformations you've tried, where were good enough. So how do we decide on all of these things, because theoretically the type of combinations you can use for different algorithms, different features and different hyper parameters.
Genetic Algorithm Approach
You know, that space is really, really huge. So we have tried it. We found a way to optimize this through an evolutionary way in order to get some good results fairly quickly. And so just to give you I will go inside one driverless AI iteration and show you what it does in order to come up with good models and good features and good parameters. So imagine you have a data set, a very simple data set in this format have four new medical features, one target where you try to predict. So what travelers will initially do is it will take these four features. It will decide on a cross-validation strategy based on the, normally the accuracy setting you said in the beginning, how much accuracy you want, and then it will pick an algorithm semi randomly. It will put some initial parameters for this algorithm.
It will tune based on cross-validation those parameters a little bit, and then you will get an X percentage of accuracy based on this test framework, for example, this K for cross validation, and this, it will come back with with the ranking stating which features are the most important. Now we can use this ranking in order to reinforce or make better decisions. Once we start the next iteration, for example, from this ranking, maybe I can infer that X one feature doesn't seem to be so important. So going forward, I'm not going to spend so much time on it. However, X two and X four seem to be a little bit more promising. So once I start the second iteration, I'm going to capitalize more on the features that seem to have more promise by other, trying better individual transformations on them, or even exploiting their interaction.
But at the same time, I'm going to allow some room for random experimentation. I don't want to get trapped into this very directed approach of looking into the data. I want to always allow some room for searching in case I find some other interesting pattern and the process continues because I will pick an algorithm. It could be the same or a different one. I will pick some parameters for this algorithms, which again, could be similar to the one before or different ones. I will slide it, tune those parameters based on the validation strategy we have selected. We will get a new percentage of accuracy, and this will come back with a new rankings as to which features is our most important. And this is not only limited to features. This ranking goes to algorithms, goes to hyper parameters.
So after a few runs, we have a good idea about what's working and what doesn't work, and always keep optimizing where we see essentially there is more juice, again, always around allowing for some room for random experimentation. So is an exploration, exploitation optimization approach which has its roots on reinforcement learning. I guess briefly I wanted to mention that maybe I'll skip this part is we obviously use a lot of work in order to determine which features are the most important ones in your data. Maybe I can quickly mention it. As you saw, there is always, our process always comes back with the ranking and the way we can understand how good a feature is and create this ranking is assuming I have a dataset. I can split it in training and validation. I can fit an algorithm on my training data.
Feature Ranking Based on Permutations
And with this fitted algorithm, I can make predictions trying to predict the target on the validation data. And that will give me an next percentage of I accuracy let's say is 80% of accuracy. So what I can do next now is take that first column, that first feature in the validation data and randomly shuffle it. So now I have one feature which is wrong in my data and everything else is correct. So if now I repeat the scoring with the same algorithm, I'm expecting that the accuracy will drop how much the accuracy dropped is essentially how important that feature was. And normally this ranking is very intuitive and is very powerful to understand really which features are the most important to include in your algorithms in order to get the best results. And then you essentially repeat this process for any other feature. So it's a good, a good, quick way to understand which are the main key drivers for your data set.
Then we use a process called stacking in order to try, because while all this process starts iteratively, we come up with various model and various transformations, which could work well. So then travelers has a process that tries to combine all of this tries to find the best way to combine this in order to get the best result possible. And the process is on simple terms, imagine I have three datasets, A, B and C. A, could be my training dataset. B is my validation dataset, and C is the dataset where I want to eventually make predictions for the test dataset. So what I can do is take an algorithm, fit it on the training dataset and then make predictions for dataset B and dataset C and save these predictions into new datasets. And I can continue this with another algorithm as well.
I can pick a different algorithm again, I can fit on a dataset, A makes predictions on B and C at the same time I can stack these predictions on these newly created datasets. And I can repeat doing that until essentially I have a dataset which consists of predictions of multiple different algorithms. And now I can use this target of the validation dataset to use another algorithm and find the best way to combine all these models, all these different algorithms I use. So we can essentially pick one new algorithm to fit on this P one dataset and find the best way to combine essentially the, all the different algorithms in order to give a final prediction for the test data set. This is normally a good approach called stacking or stack generalization implemented by Walbert in 1992. And normally can drive predictions can give you a good boost.
Machine Learning Interpretability
And the last part, before I pass over to my colleague or maybe before the break, is machine learning, interpretability this is a very important process for us because it can promote accountability and bridge the gap between the black box models and something where people can feel comfortable and understand. I think there are two main approaches colliding if I can use this term. So there is data processes. I want to have something which is 100% interpretable. For example, I look at my data and if some, everybody who's less than 30 years old has 30% chance to default on his or her credit card payment. But everyone who's more than 30 years old has less chance to default, maybe 20%. That is my model. So that this is the model I want to put in production. I've measured these values based on historical data.
Interpretability and Accuracy Trade-Off
I'm 100% certain on how it works. So that is clear accountability, but probably I can get maths better accuracy if a combine more features, make something a little bit more complicated that can search for deeper patterns within the data. But obviously I will, I cannot have exact explanation of how it works. So what we do is we use approximate explanation. The idea is you have the predictions of your complicated model, your in this case, the driverless AI model, and you try to predict it with a simpler model. So you use a simpler model in order to understand the complicated model and that simpler model could be a regression model or a decision tree, which approximately can give you an understanding of how the complicated model works. And you can build different reason codes and representations that can help you understand on not only on a global level, but even at the payroll per sample basis.
Machine Learning Interpretability Examples
Why a case have been scored like this, for example, this case had 70% chance to default 30% because he or she missed a payment last month, add 20% more because he or she missed a payment two months ago, add a little bit more because it's too young, et cetera. So using this approaches, which called essentially surrogate models you can get to an understanding of how the complicated models work and get very good insight about how the predictions are made on a global as well as local per level. And the nice thing about driverless is that once it has built all this pipeline of transforming the features, building the different algorithms, combining them, you can get different artifacts. Some one is based on Python. Another one is based in Java called MOJO where you can put it in production and do the scoring through them. And yeah, this is what I'm basically, I wanted to say happy to take any questions if you'd like to connect, these are my details. And thank you for the opportunity you gave me to present to you.
There's another question there. So will this tool negate or at least delete the need for cargo competitions? I don't think so, because once you raise the bar people can get to the next to the next stage. And I also think can push it even more, which is good. But I think at the same time, there are various elements of these competitions where a tool like travel as AI would have disadvantages. And I'm saying this, because this is a tool which is made to be production ready, for example, it does not look at the test data in order to improve the model. So because in a real world situation, you might only have training data. You never know when your test data might come in the future. For example, this is something that kagglers use to their advantage.
So they will see the structure of the test data, which they have it already in advance in order to be able to get a better score. You draw sorts. So you know, kaggle is slightly, what I'm trying to say is a little bit of a different in a different world. It's amazing that we have been able to do so well even given these disadvantages against competitors. But no, I think as I said, you raise the bar and people can push it even further, which is good. If you need to carefully format data before passing to H2O, is there any help available as to the best transforms to apply, to improve accuracy? In principle, we like the data in raw format, as long as they're in tabular format. And that's because we can iterate, we can, we iterate through different transformations and try to find the best.
For example, you have a categorical feature and you decide to put it in as multiple dam variables, but there might have been another transformation which could have worked better. So actually we prefer people not to do mat cleaning. Now, there might be some special cases, particularly on time series, for example, where a certain pre-processing of the data might actually help. but I think that's quite of a, of a bigger discussion. In principle, we like the data in raw format. We are comfortable with working with missing values and unstructured data as in text and find good representations for them.
Got a question that I can answer for out of all those there's the, that top one I could say, give Marius a rest on his voice. So is it flexible with different cloud solutions? So driveless AI is available on all the major cloud environments. So Microsoft Azure, Google cloud platform and Amazon web services is available on there within the marketplace. So feel free to go go there. What we also do is there's the learning environment that we encourage you to utilize, which will be using for this particular training something called aquarium. So the great thing about driverless technology is you can get up and running with it in a matter of minutes because of those cloud formations that we are installed on. Do you wanna take the next one, Mari?
Okay. That no more rest for me. So does driverless provide an opportunity to manually set or strict a list of models we want to go with for the experiment? Absolutely. Absolutely. Driverless can give you full control of the parameters you want to set the models you want to try the fit, the different feature transformations you might want to block or allow. So if you wanted to have some control, you can still have it. And with our next version, you should also have the which is coming soon. This month, you should also have the option to add your own models, your own feature transformations, your own metrics you have through Python. You have the options to do all of these things, and yeah, if you wanted to have you know, the control we can expose the, will we we are happy to do that,
That the next question on, in terms of how proactively do we accommodate open source updates, as, as described driverless, AI sits on top of a number of open source packages and when driverless AI gets installed then those packages automatically get updated. So every iteration of the product would automatically have those open source packages available to them. One other point that I want to really highlight that that Mary has talked about is H2O when building products is utilizing the open source community. But what we also want to do as an organization is to give back to the open source community as well. So a really great example of that was when Marios talked about the data dot table, that's now available within Python. So that was a package that was available in R. We felt that to accelerate driverless AI's data preparation capabilities, we needed something that was better than pandas behind the backend and felt that data dot table package, if it was in Python, would help accelerate driverless AI, but rather than just take that out of the R world and turn that into something that was proprietary for H2O and driverless, we said, well, let's put that package back into the open source world, but there's no point in putting it back into the, our open source world because it already existed in there.
So what we've done is to create it, utilize it as part of driverless AI, but give it back to the community. So you can utilize data dot table in your Python workflows. As Marios said, it, hasn't got all the bells and whistles at the moment of other data manipulation frameworks within Python, but we are continually developing that. and we're continually add to that into the open source community. So I just wanted to add that as a point of, to really emphasize that first phrase that Marios talked about where H2O's mission is to democratize AI it's to create tools whether they be open source or commercial software to accelerate that process, but also give back to the community as well.
So in case we have one more question, what maturity level is driverless car models at currently? I have to say, although we are called driverless, we driverless cars is not necessarily our, our specialty. We have a worked on this problem, but based on, on reports, I've seen performance was actually very good. It seems that we can already achieve better than humans performance in terms of few accidents, at least. The problem is, and I think this is where the process stuck at the moment is there was an accident. And then there is the problem of accountability who is at fault and why the accident happened, where this is a thing we need to work a little bit on to be able to fully integrate such an AI within society. And that's why I also highlighted the importance of interpretability something as a company we've taken obviously very seriously.
Okay. So that if there's no more questions from you on that takes us nicely onto a break. So what we would do is have a break for 15, 20 minutes refuel. There's some pizzas downstairs, I believe. so feel free to grab some pizzas. Don't eat too much because I don't want to have that dreaded grave shift, where everyone comes back and then feels very sleepy, but go down, have some pizzas we'll come back, we'll start to explore the product and get hands on looking at all the concepts that Marios has talked about and how those are integrated into the tool.