5 Key Considerations in Picking an AutoML Platform
AutoML platforms and solutions are quickly becoming the dominant way for every enterprise that is looking to implement and scale their ML and AI projects. As Forrester pointed out, these tools are trying to automate the end-to-end life cycle of developing and deploying predictive models — from data prep through feature engineering, model training, validation and deployment.
This often involves evaluating numerous platforms and identifying the best fit for their organization. The decision process is based on multiple considerations, including accuracy, ease-of-use, performance, integration with existing tools, economics, competitive differentiation, solution maturity, risk tolerance, regulatory compliance considerations and more.
Tune into this webinar to learn about the top 5 considerations in selecting an AutoML platform. Vinod is joined by one of H2O.ai’s Kaggle Grandmasters, Bojan Tunguz, for the discussion.
Presenters: Vinod Iyengar, H2O.ai & Bojan Tunguz, H2O.ai
Read the Full Transcript
Hello and welcome everybody. Thank you for joining us today. My name is Patrick Moran. I’m on the marketing team here at H2O.ai. I’d love to start off by introducing our speakers. Vinod Lyengar comes with over seven years of marketing and data science experience in multiple startups. He brings a strong analytical side, and a metrics driven approach to marketing. And when he’s not busy hacking, Vinod loves painting and reading. And he’s a huge foodie and will eat anything that doesn’t crawl, swim, or move.
Now, Bojan Tunguz was born in Sarajevo, Bosnia, and Herzegovina, and having fled to Croatia during the war, Bojan came to the US as a high school exchange student to realize his dream of studying physics. A few years ago, he stumbled upon the wonderful world of data science and machine learning, and it feels like he discovered a second vocation in life.
And some of you may know Bojan on through his Kaggle competitions, and his grand master title. Now, before I hand it over to Bojan and Vinod, and I’d like to go over a few webinar logistics. Please feel free to send us your questions throughout the session via the questions tab, and we’ll be happy to answer them towards the end of the webinar. And secondly, this webinar is being recorded. A copy of the webinar recording and slide deck will be available after the presentation is over. Now without further ado, I’d like to hand it over to Vinod.
Thanks Patrick. Thank you everyone for joining us today for this really fun discussion, I hope. And I’m really excited to do this webinar with Bojan who’s one of our amazing Kaggle Grandmasters, and actually has the title slide points out, he’s a double Grandmaster which is a very, very unique thing. I’ll let Bojan explain what that means in a bit.
But, before we get started, a quick intro about who we are. For folks who don’t know about H2O, we are an open source machine learning company. Been in business for about seven years now. We have a really large data science community. Nearly 200,000 data scientists, close to 15,000 companies using us on a regular basis, and close to half of the Fortune 500 companies are also using H2O on a regular basis.
We have a huge meetup community, 100,000 plus meetup members meeting regularly in different cities around the world. I think pretty much every week there is an H2O meetup in some part of the world. So, if you’re interested, do feel free to join the community and learn more about data science.
From a product perspective, these are the products that most of the community knows us for. On the left you have our open source product H2O open core, which is 100% open source, provides in-memory distributed machine learning, with all the popular algorithms that are very commonly used by data scientists. Sparkling Water that’s H2O running on top of Apache Spark. Again is very popular, probably the best machine learning on Spark, as we like to think of it. And our customers validate it.
So, nearly a third of our open source community uses us through Sparkling Water. And then we ported some of these algorithms to be accelerated on GPUs, and created a product called H2O4GPU, that gives you algorithms like [inaudible] GLM, and Random Forrest, ECA, et cetera. Fully integrated on GPUs, so that you can take advantage of the latest and greatest hardware over there.
And finally, Driverless AI is our commercial automatic machine learning platform. That’s been our fastest growing platform in the space right now. We automate the entire machine learning workflow from data ingest, all the way to production. We’ll talk a bit more about that later in the session. But that’s an extreme popular built for data scientists, built by the grandmasters including Bojan, and it does some really cool stuff.
With that, let’s jump into today’s topic quick. So, why AutoML? This is a quote from Gartner, one of the reports, and this is no news to anyone who’s in the industry. There’s a deep shortage of data scientists, and ML experts. And it’s not likely to improve in the short term. Short to medium term, because the schools, the colleges where folks are coming in, they’re only beginning to adapt now to the latest techniques.
Another challenge with that is, that the space is evolving so fast that even when a technique, or a set of frameworks are popular today, they may not be popular in a year or two years down the line. So, you need to constantly adapt and that makes it really challenging for creating a pool of experienced data scientists, who can keep coming in.
And that’s a big challenge for enterprises too. So, the goal is indeed, can we use AI to build models, and help increase the productivity of employees, in different enterprises, to do this? I just want to put up this cartoon up. One of my colleagues showed it, but the challenge of course with AutoML is, that when you show something like AutoML to data scientists, who are busy coding and cracking away, they’re too tired, too busy to try out something new.
And that’s why we want to first spend a little bit of time understanding what AutoML is, what the state of the art is, the spaces and Bojan, he has a really good understanding of a framework for looking at where AutoML is as a space. And then look at what are the top considerations, if you’re an enterprise, or a data scientist even, if you want to pick an AutoML platform for your company, what should you be looking at?
With that, I’m going to hand it over to Bojan over here, to let them take control, and talk about the data science workflow. And the six levels of AutoML. Over to you Bojan.
Thank you Vinod. Good afternoon, or good morning everyone, depending which timezone you are in. As they’ve mentioned, my name is Bojan Tunguz. I am a Kaggle Grandmaster, and a Senior Data Scientist at H2O. This presentation has been adapted from a presentation I gave up a few months ago at Kaggle Days. And there I want to take more of a bird’s eye view of what machine learning is, what data science is, and how can we automate machine learning, and take it from there?
And understand what different degrees of automation, and machine learning may mean. Here we have a general data science workflow, that goes from formulating the problem, acquiring data, to data processing, modeling, deployment, and then monitoring. Many of these stages are actually included in our Driverless AI tools, but for purposes of this bird’s eye presentation, I will just concentrate on the middle part, on this modeling part.
So, modeling is part where you actually already have all the data, in more or less the shape that you want it to be. And you’re just having to create the best, most effective model that you can. Now, what the best model is, will depend on different situations, and different domains. Many times, it means just getting the most accurate model, but in many other situations, it means most robust, or some other thing that needs to be optimized.
But, to actually go beyond that, I want to start, why would anyone want to have AutoML? There are many reasons, as Vinod has already mentioned, there’s increased demand for data scientists, and machine learning applications. A demand that’s not always being met. There’s a relative shortage of people with the relevant skills. And the number of positions is far outstripping the number of degrees, or any kind of certificates that’s being offered in the field.
Sometimes you just want to try ML on some simple use case, before committing to actually having a data scientist. Yet, you want to have ML that’s as good as, or at least close to being as good as something that a data scientist would produce. So, you want to try the waters, before you swim in. Then, various non-machine learning practitioners, analysts, marketers, IT staff, they want to have some part of their workflow, that includes machine learning. But, they don’t truly necessarily need to have a full time data scientist.
So, a tool that would do most of the things that the data scientists who creates machine learning models would do, would be useful for them. And then, if the tool is good enough and does everything that you really need it to do, then you can save a lot of money. Instead of hiring a data scientist for a $150,000 a year, you can get a tool that’s much cheaper than that, and then use it on a need to need basis, only when you actually really need it. And get most out of the investment.
Another one is faster iteration. If you have a tool that actually can do automation, and machine learning, it allows you to faster iterate over development. Instead of having to code something, and take a few days to actually implement it in the code, you can actually just take the data, put it in inside of a pipeline, and within a few hours, you can actually have an answer to your question. Or, see whether the data that you have, can actually answer those kind of questions.
This is one of my favorites. If you perform more and more different experiments, you’re getting closer to actually really formulating the problem, and approaching machine learning problems as a scientist. Meaning, running experiments, looking at the outcomes, and making your decisions in a future iteration, based on those decisions. This is one of my mantras, “Putting science back into data science.”
And then, one of the things is that the number of people who are entering the data science field is increasing. And, if there’s competition for these jobs… This is actually my son a few months ago. He picked up a book on neural networks. And I don’t know how much he’s retained, but it’s the direction where we’re headed, with data science.
So, the six levels of AutoML, I was inspired by… You might be familiar with six levels of autonomy in driverless cars? So, there are different levels. We are pretty much at level three, or four with driverless cars, depending who you ask. And fully autonomous cars would actually just pick you up from spot A, take you to spot B, without actually really needing to give any additional guidance.
So, I was to come up with something similar for automated ML, and I’ve come up with six different levels. Naturally, the first level would be level zero, which is no automation. You just code stuff from scratch, in probably one of the relatively low level programming languages like C++, and one person who does that to this day is this Australian Grandmaster Michael Jahrer, who I had the privilege of working with.
And looking at his code, it’s just breathtakingly detailed and sophisticated. But, obviously most people cannot fully implement something like that. Level two would be, uses some high level algorithmic APIs like Sklearn, Keras, Pandas, H2O, XGBoost, and similar. So, that’s where most people who are participating on Kaggle these days, that’s where they are.
We all rely on some of these tools, and probably the reason that Kaggle has expanded, and become so prevalent and popular over the last few years, can be easily tracked to promotion of some of these tools. Some of which were actually first introduced on Kaggle. Keras, and XGBoost were actually specifically introduced for Kaggle competitions. Now, they’ve become standards for a lot of machine learning workflow.
Now, level two is you automatically tune a hyperparameter, and do some assembling, and some basic model selection. There are several packages that are similarly high level, like Sklearn, that help you optimize hyperparameter. There’s a Bayesian Optimization, which is a very popular package. It’s the one that I like the most, Hyperopt is another one. So, there are several different strategies, and many of them are getting automated nowadays, for tuning the hyperparameters for some of these algorithms.
And ensembling is the golden standard these days, for making the best and most predictive models. We all know that no single model can really outperform an ensemble of different models, that each have their own strengths, and peculiarities. So, some of the level two automatic machine learning tools, can do this ensembling by themselves. And the H2O package AutoML, is one of them, for instance. It can build several different models, as you boost generals, like linear model and few others, and then assemble them into like a very strong predictive model.
Now, level three is more or less where we are at right now, in the system, or maybe a little bit of a level four. This is where automatic technical feature engineering comes into play. And by that I mean, feature engineering that can be done just using technical aspects of the features, without fully understanding the domain where these features are coming from. That would be like labeling coding, for categorical features. Targeted coding in some cases. Binning of different features, and things like that.
So, these are technical things that don’t fully depend, by and large, on expertise in a particular domain. They can be automated, and that’s where we are right now. Another one is the introduction of graphical user interface, which I think really liberates machine learning from being just a toolkit of software engineers, and data scientists, to becoming much more accessible to a really wide spectrum of people, who want to use it for their daily work.
Now if you have some kind of very specific domain, certain feature engineering’s would only make sense there. For instance, in a credit risk, loan per income would be a very good feature, that otherwise you may not be able to figure out if you’re just looking at anonymous features. So, this would be an example of some domain specific feature engineering.
Data augmentation is again, using different… This is more the domain specific data augmentation. An interesting example of that, was recently that I came across a presentation where in order to classify images with different eras, artifacts would be added to those images, that would only be relevant to that era. So like, if you have 1950s image, and you had a radio from that era, that would be okay. But, if you add an iPhone to that, that would actually be a really bad idea, so that would be some domain specific data augmentation.
Now, for level five, which is the sixth level in this taxonomy, there will be full machine learning automation. It’s the ability to come up with superhuman strategies for solving hard machine learning problems, without any input or guidance. And then, possibly having fully conversational interaction with the human users. So, something that instead of talking to a data scientist, you could talk to an automation machine learning tool, and come up with a strategy of how best to formulate a problem, and what kind of model to create.
Now, we are still pretty far away, even thinking about this stuff. Many people have told me that sounds more like science fiction, than science reality. So, the big question is, is full AutoML even possible? According to the free lunch theorem, there’s no single approach to any machine learning problem, that will outperform all the others. However, we’re not truly trying to solve any possible, in the universe of possible machine learning problems, problem, but something that’s very relevant to real world.
Which greatly simplifies, and restricts the number of problems that we can can solve. And for real world problems, we do have people who have expertise in different fields, who can actually come up with a strategy, that we can learn from them. I came up with this term, “Kaggle Optimal Solution,” and that will be the best solution that could be obtained through Kaggle competition, provided there are no leaks, special circumstances, and other exogenous limitations.
Kaggle has been proven to be able to outperform a lot of times, even the best domain experts in particular fields in coming up with solutions. So, this will be some kind of superhuman possibility. So, if you have enough people working on a problem for an extended period of time, who are familiar with machine learning tools, they can come up with optimal solutions that in many ways, no single human could possibly come up with.
Now, we know that these solutions do exist, because there are Kaggle competitions. And I’m claiming that if we can capture this, that would be something that a fully automated machine learning environment could do. And superhuman AutoML would be beat the best Kagglers almost every time. And that would be something that’s still far ahead, and not really clear, how can we get to that point?
So, I’ll just briefly again, go over some of these levels. So, no automation, you implement machine learning algorithms from scratch. It requires a very high level of software engineering, and it’s not easy to actually do, for most practitioners of data science. Now, at this level in the old days, most of the people who were doing machine learning, would actually be writing the tools from scratch, which was obviously not optimal use of their time. And it’s very, very hard to scale.
I’m just giving a logistic progression here, as implemented in C++, that I’m not sure if you can actually really see it, but it’s really bare of implementation. And this is my very crude depiction of what making those tools looked like back in the days, when we were doing it from scratch. Now, there are times where you do want to do something from scratch, namely when you really want to understand some of these algorithms. And there’s some good resources out there, including this book that I highly recommend, where you do implement some of these algorithms from scratch.
If you’ve been a data scientists for a few years, and already have some familiarity with all these algorithms, it would behoove you to take a look, and really try to understand, and even try and implement some of them from scratch. Obviously, you can’t do fully some of these more complicated residuals, and neural networks from scratch, that would be impossible. But, some of the simpler algorithms will definitely be worth your time, from an educational perspective to try to do.
We live in days where there’s a really plethora of different APIs to use for building machine learning algorithms, data science pipelines. And if you’re into that stuff, then this is really a great place to be. But, the problem with this is several times a day, I hear about a new great tool that’s being implemented, that really does part of your data science workflow, and it’s very hard to keep track of all of these things.
So, it’s great from a standpoint that many of these can do a lot of things, but it’s still very hard to keep track of all the tools that are out there. So, high-level API, as I’ve mentioned Sklearn, Pandas, XGBoost, Keras, H2O, it allows for novices who still have some coding skills, to actually go from building very simple models, to being very proficient in a short amount of time. It’s pretty standardized. For instance, Sklearn API is becoming, the default these days.
And here for instance, is a Sklearn’s implementation of logistic progression. If you remember that slide, from a few slides back where it was implemented in C++, you can immediately see why having these APIs would really make life so much easier for most practicing data scientists. For level two, we have automatic tuning. You could consider it the first real AutoML. It’s where you start taking several different models, taking data sets, specified targets and led it create the best algorithms, out of some subsets of algorithms that you could think of.
It selects a validation strategy, cross validation versus time validation split. Now, in most cases this automatic cross validation works, but the really hard cases are the ones where there are some peculiarities of the data, and simple out of time validation, or CV validation can actually really burn you. So, this is one of those things where you really need to know that your data is such, that some of these cross validation strategies can work off of the box.
And then it optimizes hyperparameters. It chooses the best learning rate, for instance, the number of trees, sub-sampling of your dataset. So, these are all hyperparameters that many of these APIs, lets you pick. But, it’s very hard to understand which ones are the optimal for given problem.
And then it’s performs basic ensembling. For instance, if you have two algorithms and they give you predictions, you can take average of those two predictions, and that’s very simple ensembling. But, if one is performing much better than the other one, yet the other one is not completely useless, finding the right weight, it can be tricky. And some of these automatic tools do that for you.
So, for instance, hyperparameter optimization level two, there are several approaches to it. There’s a grid search, where you choose pretty much all the available hyperparameters in some space, and then try every one of them. And that’s very computation expensive. Random search, where you just do a subset of those hyperparameters. And this is the comparison between the two. And then there’s Bayesian search, which uses Bayes’ theorem to actually do something very smart about where to look for next potential hyperparameters, given the ones that you already looked at.
And it uses Gaussian process to actually look for a different potential hyperparameters. When ensembling, some of these, “Level one,” algorithms are already considered ensembles, like Random Forrest, or XGBoost, but for all practical purposes, as practicing data scientists, we treat them as a fundamental algorithm, that we want to ensemble with some other ones. So, we want to take a look for instance, at different approaches to ensembling like blending, which is what I mentioned earlier, or finding weighted average of weak models.
Boosting, which is iteratively improving blending. And there’s stacking, where you create K-fold predictions of base models, and use those predictions as meta-features for your next level models. So, these are some of the basic ensembling approaches. And most of the level two algorithms, and level two AutoML solutions can do a pretty good job with these.
For instance, this is an example of a very complicated ensemble, where you ensemble things at several different levels for a final prediction. And this particular example is from my solution for distinguishing between cats and dogs. Now, you would think that this is one of the simplest possible, [inaudible] visualization… That you can do some very fancy, and very complicated involved ensembling. And some of the more advanced machine learning tools can do this for you.
Now, level three, which is where most of the good solutions on the market are right now, we have automatic technical feature engineering, and feature selection. So, feature engineering refers to the fact that you create new features, or you do something to existing features to make them yield more information. There are many ways of doing it. I’ve mentioned some of them binning, finding feature interactions, targeted coding.
Many of these are implemented in good AutoML solutions. Technical feature selection is a little bit harder to do, and it’s not very well done even by most experienced machine learning practitioners. Once you create many of these new features, which ones to use? Even the ones that you already have, many of them may not be optimal for your problem. So, you have some tool that automatically decides which features of the ones that you created to use, would be good for the model.
And then, technical data augmentation, is where you for instance, for images you can flip, rotate, add some noise, and do the other things. And then finally, we have the graphical user interface, which makes nontechnical people be much more effective with creating good machine learning models, that they can use for their own workflow. And an analogy I would like to do is, if you word processing versus typesetting everything in latex, this makes more people be able to write good looking and effective documents, even though they have low-level technical skills.
So, here’s automatic feature engineering. You can use different encodings for categorical data. You can use different encodings of numerical data. Aggregations, and feature interactions. These are all things that you can do [inaudible] data. Now, word embedding is when you have textual data, and you actually turn text into some kind of vector, and some vector space.
And then, for images, you can have pre-trained, neural networks that can actually turn image into again, an array that you can then use for other machine learning algorithms. And now, for technical feature selection, we would have things like selecting features based on feature importance of some test model. You train a model, see which features are most relevant, and then just keep the top five of those.
You have things like forward feature selection, where you select feature, see how it performs. Add another feature, see if the model improves, then keep that feature. And if it doesn’t, throw it away. So, go one by one through all the features, until you find set definitely works well with your given model. The opposite is recursive feature elimination, where you actually start with all the features you have, and then eliminate one by one all of them.
Now, all of these are very computationally intensive, and not maybe optimal for most problems, especially if the number of features easily can go into 1,000s, or 10s of 1,000s, after some of the feature engineering. There’s also permutation impact, where you just take one feature out at a time, and see how it works without that feature. So, as I’ve mentioned before, generating new features results in a combinatorial explosion of possible features. And we need some more sophisticated strategies, to actually select features that would be useful for our model. And this is what some of the best AutoML tools that are on market right now can do for you.
For instance, one of the approaches for this is in genetic programming where you, “Evolve,” features, and then look at which ones survive, and create the new subset of features based on those evolutionary algorithms. And that’s something that’s for instance, they implement H2O’s Driverless AI. Now, when it comes to technical data augmentation, there are many different things that you can do. For instance, adding stock value process to temporal data.
If you want to look at how economy’s performing, or if some other financial indicator’s performing well, you can look at add stock prices over time, to see if there’s some correlation between, stock markets, and default trades for some kind of loan. You can add geographical information. This is something that’s very informative. But, with this one you have to be very careful, that you don’t introduce some kind of regional biases in your models, which for regulatory purposes can be a questionable practice.
And for instance, FICO scores can be another important additional piece of information. If you’re running a loan business, FICO scores obviously would be a great piece of information to have. Then there’s this trick that some of our teams on Kaggle competitions have discovered, where you have textual data and you do some kind of automated translation of the text into another language, and then translating it back into the original language. That introduces some noise, and the hope is that this noise would actually help you with the ensemble of the model that you’re building.
And again, this is very technically straightforward. There’s no human in the loop in this process, no understanding of either one of the languages that needs to be involved. And injecting noise is a tried and tested way of dealing with data augmentation. And then you can do various math transformation on sound and images and other things.
Then there is image specific transformation, blurring and brightening color, saturation et cetera. So there are a lot of different ways that technical data augmentation can be done. There are libraries out there that do it for you, but there are good automated solutions, that also do it in the background. GUI is again, one of the things that I think every level three AutoML needs to have. It facilitates interaction with software, allows for many non technical people to use it. And further facilitates iteration and development.
Now, level four is beyond where we are right now. It requires automating specific feature engineering. It requires the ability to combine several different data sources into a single one, suitable for ML exploration. Again, I like going back to the loan business problem, if you have different tables that come from different aspects of the loan process, how you combine these tables into a single one that can be suitable for machine learning, is nontrivial.
How do you aggregate data from a transaction history or whatnot? None of these things are easy to do. And we don’t have automated way of doing this, to this day. Some domain specific cases may have it, but in general, we don’t have automatic feature generation. You can do some advanced hyperparameter tuning, where you go beyond Gaussian optimization, and some of those things that we’ve mentioned before.
Automated domain problem specific feature engineering, where you do aggravations according to what makes sense for instance, for your problem. Adding a particular kind of noise to images, that really make sense for those kinds of images, and no others. So these to this day, require a lot of human interaction, and human understanding of the problem.
Then, ability to combine several different data sources, joining tables. Understanding which mergers makes sense, and then executing them. Understanding which aggregations makes sense. All these things still require a lot of human interaction. Advanced hyperparameter tuning, manual hyperparameter tuning is still my number one, go to approach for tuning hyperparameters. I’ve tried many of these packages, and I can still by using my own intuition, and experience with tuning some of these XGBoost, for instance, the algorithms, I can still come up with better hyperparameters than many of these auto solutions out there.
So there’s still a lot of intuition and human understanding that’s involved, that we still are not able to completely capture. This is a deep understanding of the data. It may require some transfer learning. So, building hyperparameters based on a previous experience, with different hyperparameters.
For specific feature engineering, domain understanding will be crucial to create different features. The ability to get additional data, based on a problem in domain, and integrating into the ML pipeline. That’s still very much where real world data scientists come into play. And now we’re going to full ML automation. This is a little bit beyond what even we are able to foresee right now, but this is I feel, what fully automated machine learning solution would look like.
It requires the ability to come up with superhuman strategies for solving hard ML problems, without any input or guidance. Fully conversational interaction with a human user. So, this would be essentially having a Kaggle Grandmaster sitting in front of you, and [inaudible] solutions to your problem. Up to level four, all the automation is essentially hard coded. It’s still something that you have to come up with prior to deploying the solution, and then having solution run itself.
But now, for full automation we need to use machine learning approach to how to build it. And what that means, is use machine learning to teach AutoML system how to do machine learning. So, machine learning is approach to building software and product, that requires a lot of data. Now, we would really require a lot of data, and use cases of how to build machine learning pipelines, and then train the model on all of them, to actually come up with a superhuman approach.
We might need some unsupervised approach. So, this will be machine learning, from machine learning, from machine learning. The idea in principle is simple, give the ML system a large collection of ML problems, and then solutions, and then let it learn how to build ML systems. Now, the idea is simple, execution is hard, because we still have relatively few machine learning problems to work with.
It’s very daunting, even the simplest ML problem requires 1,000s of instances to train on, for decent performance. However, we probably don’t need to build all this from scratch, we might be able to bootstrap on top of the previous level of automation. So, if you have all these previous levels, and it’s working fine, then you can do something maybe like reinforcement learning. Or you can use unsupervised techniques.
If we could parametrize our problems, and parametrize possible solutions to them, then we can come up with a space, or universe of human relevant ML problems, we might be able to find some patterns in there. So, unsupervised methods are much better suited for situations where you don’t have too much data, but you still want to understand something about it, and come up with solutions. And then there’s reinforcement learning. Building on ML solutions, and based on how well they perform, adjust the architecture.
So this would be, have an environment where machine learning tools can actually learn from the experience of trying to solve machine learning problems. This would be adversarial AutoML. Have AutoML systems compete against each other, make a Kaggle competition that’s only open to AutoML systems, and iterate. So, this is a possible something that could come in the future, where you have different AutoML systems compete with each other, and learn from the experience.
And I would argue that fully conversational interaction with human user, would be another thing that you would need from such a system. So again, this would be a AutoML system that doesn’t necessarily pass the full Turing test, but has enough of a domain specific understanding of machine learning, to get past Turing test for machine learning problems. And be able to interact with you, like you would with any other data scientist. It would democratize machine learning, and make it even more accessible than it is right now.
And a lot of times, even formulating a machine learning problem is a very iterative process, where you interact with people, domain experts, other data scientist, analysts, to actually come up with something that can be really useful for everyone. And actually, there are a few downsides, but I’m not going to spend too much time on these. I’m just going to click through them, because I want to hand over to Vinod, who will introduce you a little bit more to our Driverless AI system. All right.
Thank you Bojan. First it’s extremely useful to understand where the state of the space is. So, in a nutshell, to recap what Bojan said. I think of them almost Gen One, which is basically a lot of the opensource frameworks, which do hyperparameter tuning, ensembling, and lead broadsided approach. Then you have Gen Two, which Bojan talked about. It’s coming up to level three, and level four where you’re getting to feature engineering, evolutionary algorithm techniques to do a lot of the work.
And then going forward to Gen Three, which is basically getting the full AutoML, where AI is available at your fingertips, as you’re looking at data. So, it’s using new interfaces too, in different places. Now, with that said, if you’re an organization, or data scientist looking to pick a platform to do AutoML, what should you be looking at today? Based on what the state of the industry is, here are the top considerations in my mind.
To begin with, think of how can you automate the entire workflow? Bojan mentioned this earlier, when he specifically talked about the featuring and modeling, but essentially of that typically as data prep, data ingest, and at the downstream you have model deployment monitoring. So, how can your platform automate as much as possible, so that you can spend more time thinking about the problem framing, and actually evaluating if it’s doing the right thing?
Portability and flexibility becomes important. We’ll spend a little bit more time thinking about it, but what that means is you don’t want to be locked into one vendor, or one environment. So, there are considerations on where the data is. Is it on cloud, on-prem, where is the computer running. What about running it on different hardware, like using GPUs, or CPUs, or latest process for that matter. And then running them in different configurations as well.
Along with that it also comes the idea of accessibility. Until now, we’ve been thinking about automation, but as Bojan mentioned, you’re never going to get to full automation really. So, in the interim, can you extend it, can you customize a platform? Add your own flavor to it? So, this could be newer algorithms that the platform may not have today, but are there ways to add those? So, that’s an important consideration.
And finally, explain-ability, trust and transparency. I think that’s really important. With automation, it becomes even more critical, because essentially now you have a massive black box. So, if you think the algorithms are black boxes themselves, now you have a massive black box, which is AutoML. You give it data, and it gives you a model and a prediction. So, if you can’t explain and validate the model, then that becomes a big challenge.
So, thinking about what tools are available to achieve that, is a big consideration when you’re picking a platform. And then you also want future proofing. Part of the reason why you’re doing AutoML is as one person cannot learn everything, or be a master in everything, and it goes to the no free lunch theorem. I have my own take on it, it says there’s a no free lunch theorem for data centers as well. No one data center is an expert in every single field.
Even in Kaggle, we have some folks who are deep learning experts, some folks who are GBM experts, some folks who focus on feature engineering. So, you want the platform do all of it for you, and find the latest and greatest in every single field. So, think about that, as it becomes critical.
Just to extend that a little bit further, when you look at the challenges in the AI model development workflow, and there is feature engineering, model building, and model deployment at the high level. But even within those there are sections. Looking at things like simple encoding, advanced encoding, feature engineering, feature generation, which is now looking at interaction effects on transformations.
And then within the model building part of it itself, you have the algorithm selection, what framework to use, dealing with the parameters, and then ensembling? I know that Bojan touched on that quite a bit. And then when it comes to deployment, you are looking at can you generate a pipeline? Can you easily deploy it, can you monitor it? Can you explain the result and then document the whole workflow?
And all of these are time consuming tasks, because each of them require a lot of work. They’re often iterative, and [inaudible]. So, look at tools that can automate the entire workflow, and at H2O for example, we have two AutoML offerings, one is the H2O opensource AutoML. That’s very widely used. And that basically focus on the model building part, I’d say. It automates the model part properly, so it does all the Gaussian selection for you, the parameter tuning and then the ensembling.
But, it also does some simple encoding, and generates a pipeline, at least for the model building portion. And then when you jump forward to H2O Driverless AI, what we set out to do is full access as possible for the entire workflow. So, we took the remaining pieces as well, and automated it. So, that becomes useful for you to see when you’re picking a platform.
Now, coming to the portability and flexibility question. As I’ve mentioned earlier, you’re really looking at saying, “Hey, can your platform run on cloud? Can your platform run in a hybrid fashion? On-prem?” And what are the considerations when you’re doing that in a different way?
So, for example, as your data size increases, you want the ability to run it in a distributed fashion, that ability to handle larger data sets, handle varied data sets for example. And those become critical. For example, data sources could be coming from whole different places. So, you cannot restrict yourself to one single data source, it could be coming from on-prem, or cloud.
And then integration. Because, there is a whole bunch of tools in the arena, that you probably are going to be needing to use, along with your AutoML platform. So, think about picking platforms which can integrate as much as possible with other tools, so if you already have a certain set of tools for data managing for example, or data prep, or even for infrastructure management, the ability to integrate with existing frameworks and platforms is critical.
In the same breath, also think about all the different data sources that your data might be coming in. So, you cannot just have the simple CSVs, or the TSVs, but there’s big data formats. There’s data frameworks, data frames which are coming from different large forms. You want a breadth of connectors, and integration services.
So, what does this look like? These are some examples that we published ourselves with some of our partners. So, when you talk about flexibility, this is a cloud deployment that a lot of our customers use. Data coming in from Snowflake, and other data carriers in the cloud. Pulled in through Alteryx, to do the data prep, and then Driverless AI for the featuring, and model building. And all of this running on either AWS, or Azure, or GCP. And this is just one deployment architecture, that’s very popular with the customers.
But, very similar to that, we can do a hybrid one. And again instead of running through cloud based databases, you’re running it on something like BlueData, which in turn connects to on-prem HDFS, but also bulk storage. Again, using some data prep tool in between, and then running the feature engineering, and machine running in Driverless AI. So, the environments you run are going to be very different.
And if I flip this to a completely on-prem based solution. Now you are looking at data integration from a whole bunch of data sources, like SQL data warehouses, or HDFS, or just file systems like Menu. And pulling that in to do the data quality and transformation in something like Spark for example. And then running the feature engineering, and model building in Driverless AI. So, you want the platform to be flexible across these different type of deployments.
And as you’re considering that, think about these things. Data gravity becomes critical, as your data sizes gets larger and larger, you don’t really want to ship it across. So, find a platform that can run close to your data. That way you can avoid shipping the data over the network. It also makes for a very secure connection, in the sense that often some of these data might have private data, PII information, or HIPAA compliant data, in which case you don’t want them to be sent to different places, without a lot of careful vetting.
So, if everything can run within your own secure firewall PPC, then it’s perfect. Similarly, look at frameworks. There’s no single tool or solution. And this space is rapidly evolving too. So, you need to know what you don’t know. Having an awareness of what you’re not good at is important, so that you can have the platform find the best ideas for you. So, the latest technologies, latest techniques, new networks, newer architectures.
Can you be the first to market? Pick a platform that can give you the velocity of innovation, and implement that for your company. And finally, similarly in the same breath, you have hardware choices. There’s a lot of improvements and progress happening on the hardware front. The latest CPUs, and GPUs are much, much faster than even a couple of years ago. And especially when it comes to RTML, it is a very, very computing intensive, and resource intensive operation.
So, you do want to take advantage of latest innovations there, so find platforms that can take advantage of the latest [inaudible] GPUs, or [inaudible] for that matter, whatever you can. And this is just an example of how with Driverless AI, we are able to run this completely different configurations, using something like OPTANE DC memory, processing memory from Intel, but doing that with a very heavily overloaded persistent memory, and the latest CPU infrastructure.
And conversely on the other side, with the latest DGX-1, and DGX-2, we can run on GPUs as well, so up to 16 GPUs scaled on a single machine, with 30 Gigs of GPU memory each, can give you some phenomenal performance as well. So, find platforms that can take advantage of the latest hardware, this is a couple of good examples over here.
Let’s talk about extensibility for a second. What I mean by that is as I mentioned earlier. At least, we think of it in three different way. So, first is even if your platform does some automatic feature engineering, what about other stuff that you know as a domain expert, can you bring those in? So, custom feature engineering is important. Doing your own transformations, your own domain specific interactions, for example. Bringing them into the platform, can the platform take advantage of them?
Custom algorithms. Obviously there’s a whole bunch of different frameworks out there, which are really good at a bunch of different problems, but there’s new stuff coming all the time. So, you want to be able to try new stuff, and see quickly if that makes sense for you. So, having the ability to bring custom ML algorithms is critical.
And finally, custom loss functions. So, this is a part which is going to be very tough to automate anyways. This is a part that you know very well. You know what your customer’s lifetime value is. So, sure enough a highly valued customer, as compared to a not so valuable customer is very different. So, you want your loss function to be optimized for that. You want to optimize for business metrics that are important for you. So, find a platform that allows you to do that.
In Driverless AI for example, we have now the ability to add custom transformers. Literally you can bring in your simple piping recipe that can do this for you. And this is just one way of doing it, using a very simple cyclic lens style API, so any data scientist can implement their own transformer, or their own custom model. And then Driverless AI’s engine will use them, just as they were needed, so that’s important.
Along with that, look for platforms that give you a whole community of opensource recipes. Often times you’ll find that you’ll want to use recipes that are already prebuilt, so find a platform that has a prebuilt community, because then you can reuse, and repurpose, and there’s a lot of collaboration that happens, and things get better. So, instead of just picking a platform that does all the work themselves…
Obviously no company can have so many people, the community’s obviously larger. And especially if you can take advantage of the Kaggle community, which has some phenomenal scripts and recipes, can you bring them into your platform. The final topic I want to touch on very quickly in only a few minutes is about explainability, so trust, transparency, and explainability, they’re critical. Why do they matter? So, obviously AutoML get pitched into the black box realm a little bit, because you’re letting the machine make all the choices.
The parameters, the features, the tuning. And it’s all algorithms, so everything is being done by the machine. So, you essentially have a bigger black box now, and you have to figure out what’s the trade off between interpret-ability, and performance. So, there’s obviously a lot of techniques like using surrogate models for example, or even getting approximate explanations. Those are important, can your platform show you, look for those as you’re making decisions over here?
Similarly, there’s obviously a multiplicity of good models, again it goes back the no free lunch theorem. If there’s a lot of different models, which ones do you pick? So, the same objective metrics, can you pick a model which is simpler to understand, and more interpretable, but at the same time gives you performance which is similar to, or close enough to the best model up there? That’s important.
Fairness and social aspects, these are becoming more and more critical. As especially ML models are becoming very prominent in things like credit scoring for example, or even healthcare use cases, it’s critical to evaluate if the model doesn’t bring human bias in, or build discriminating models, that can discriminate on the basis of gender, age, ethnicity, et cetera.
So, how do you do that? You want techniques that can help identify disparate impact, and help remediate it. So, find tools that can help you do that, and see if your platform supports that. Trust is of course critical. You want a whole bunch modern debugging tools. Think about ways to debug the models. There’s obviously the classic techniques, data mining techniques like PTP, how it’s really, really important to extrapolate, et cetera.
But, going beyond that, what are the techniques that are available to help debug them, and understand how well it’s performing in real world? So, giving that level of granularity on each individual prediction is important. Finally, the last pieces are security and hacking. This is a very, very new topic. And I highly recommend watching a webinar, by Patrick Hall, who has a full webinar on this topic, and a there’s a blog as well.
But, can you use these techniques to understand if your model is vulnerable to certain regions of the space? Can your model be hacked by some influences which are designed to fool the model. So, how do you do that? Using different techniques to identify those things, and solve all those. Building adversarial models, adversarial data sets, to actually find those weak spots for your model. So, think about those considerations as well.
Especially when you’re using AutoML, because you don’t know what actually went into it, you may not know fully. And then regulatory and controlled environments, this becomes very critical. This is something that we tackle all the time. And some of the largest banks, healthcare companies, insurance companies use our models in production, so there are legal requirements, to be able to explain every single prediction.
For example, if you deny credit to someone, you have to explain exactly why that happened, why the decision was made. Similarly, if you are going to make a decision on insurance payout, or denying the claim, you’ve got to explain as well. So, fairness and bias reduction are important considerations as well. So, think about all those things when you’re making a decision on what platform to buy.
Ask this questions of your vendor. And obviously we at H2O are very focused on these things, that’s why we are talking about this. But, we truly believe that this is important for the space. So, as an enterprise, and often times we’re hearing from our customers, who ask us these questions, we tell the same thing.
With that, I think we are at the end of the deck. I know we ran out of time. We probably don’t have any time for Q and A, but I recommend posting the questions. We’ll try to get back with answers on those. I want to thank Bojan here for jumping in last minute for this webinar.
He gave a really wonderful on the space of AutoML. We hope that this was fruitful and productive, and informative. But the presentation team basically want to thank everyone who joined us today, and we’ll send the slides and recording over shortly. Thank you.