Senior Solutions Engineer and Data Scientists at H2O.ai, Jon Farland, walks through machine learning concepts and worked-out examples with the H2O AI Cloud.
Read the Full Transcript
Hi, and welcome to this video on machine learning concepts as well as some worked out examples using the H2O AI Cloud. My name is John Farland. I'm a Senior Solutions Engineer and data scientists here at H2O.ai. And I spent a lot of my time focused on using machine learning to solve a wide range of problems across industries, ranging from simply getting a model going and quickly into production, to tackling those very complex problems that machine learning really excels at solving.
Today will cover three topics, the first of which is an introduction to some fundamental machine learning concepts and common problems machine learning is often applied to we'll also take a look at an overview of the H2O platform by looking at both are open source origins as well as our end to end machine learning solution for the enterprise. And last but not least, we'll get a demonstration of that end to end solution from the enterprise with one of our key AI engines called Driverless AI.
Let's focus on the machine learning concepts.
What is machine learning? Given its recent attention in research and industry, machine learning has been referred to in many ways. For example, Tony tether, the former director of DARPA, has called machine learning the next internet. Sebastian Thrun, founder of Google X, Google self driving car team and a professor at Stanford defined machine learning as the science of getting computers to learn without being explicitly programmed. For our purposes, we can define machine learning as the process of feeding data to algorithms that learn the best mathematical rules in order to make accurate predictions from new data. ML is also a subset of the wider field of artificial intelligence or AI.
A visual representation of a typical machine learning process breaks it apart into two stages. And the first stage we are taking historical data typically with an outcome we're interested in predicting and feeding into one or more ml algorithms. At this point, a data scientist is interested in understanding how well the model fits the data, but also how well it is able to generalize the data it hasn't seen before. techniques such as K fold cross validation, as well as splitting the historical data into training and testing partitions is key to measuring and validating model performance. Tools like open source H2O, three Auto ML and Driverless AI are highly automated and fitting ml models and generally provide high performance models in a very short amount of time.
In the second stage, we can place our fitted machine learning model in a software system to start making predictions based on live data. Another term for this would be to deploy our model to either a development or production environment.
Now that we've seen a high level visualization of the machine learning workflow, we can take a step back and take a look at the life cycle of a full data science project. The first step in any data science project is well to gather data. Of course, this isn't always the most glamorous job and often takes the most amount of time, but it's arguably the most important as it sets the foundation for model performance and ultimately a successful outcome. It's important to consider enforcing strong requirements on data quality and completeness. The age old adage of garbage in garbage out definitely applies here.
The second step in our project might be to undergo feature engineering. variables that are useful in predicting our target variable are sometimes called features, and engineering, the right set of features is often will can separate a good machine learning model from a great one.
It's worth understanding how certain algorithms work with certain features and data types. For example, some algorithms have no issue with missing values and can account for them inherently, while others require features to be in a specific, standardized input. Example features might be the day of week or hour of the day we get from a timestamp column, or mathematical operations, such as taking the natural log or square root of a predictor variable.
The third step is training our model, it's useful to understand which families of models have worked well for similar problems in the past. In order to prevent overfitting the Model A key component of machine learning is to tune them and specifically their hyper parameters. It's also typically quite useful to combine or blend multiple well performing models into what's called an ensemble.
The fourth step is to evaluate our model. This task often includes the involvement of project stakeholders so that there's alignment on how success is measured. And once a measure of success is agreed upon techniques such as cross validation can be used to measure performance on data that the model hasn't seen before. This gives us an indication of how the model might perform in a real world forecasting scenario.
The fifth and last step is to deploy your model. In this step your final machine learning model is typically placed in a hosted environment when used to make live predictions. As this model is used over time, it's important to monitor performance. If we see a degradation in predictive performance, then we might also want to test for possible drift between the original data the model was trained on, and the data we're using to make predictions.
While this process diagram might appear relatively linear in nature, data science is often an iterative process that has several feedback cycles and is continuously improving. There's always another project to improve or build upon our model.
So what kind of problems can we solve during a data science project? The three most common problems that machine learning models are used for are supervised learning, unsupervised learning and reinforcement learning.
As its name implies, supervised learning problems are defined by having a response or target variable that effectively supervises the way the algorithm learns. Many ML algorithms aim at optimizing an objective or loss function that's expressed in terms of the target variable itself. An example of this might be forecasting the future demand for a particular product at your company.
Unsupervised learning problems, on the other hand, don't have the luxury of a defined target variable to guide the algorithms trajectory. In this situation, the algorithms goal is to take the data as input and to identify the underlying patterns of structure within the data itself. An example of this might be to cluster customers based on their observed buying habits. And finally, Reinforcement learning is neither truly supervised or unsupervised, and aims to develop strategies as it experiences various rewards from its interactions with a simulated environment. An example of this might be to teach a robot to walk or to play chess. As a clarifying note, there is a certain degree of overlap between these problems in terms like computer vision, deep learning transfer learning, but while they're they may be referred to separately, they're often considered a subset or special classes of problems that we've described here.
Let's take a look at supervised learning. a more concrete example of a supervised learning problem might be the goal of forecasting our expected sales dollars as a function of how much we spend on various advertising channels, such as TV versus radio. With historical data, we can fit a regression model and make predictions about what our future sales might look like depending on our advertising spend. With any supervised regression problem, the target variable is continuous in nature.
There are many algorithms that perform well here simple, simplest of which might be a generalized linear regression model. As a next effort. If we see a nonlinear relationship in our data, we may want to try and approach with less assumptions on the specific functional form of our data. And so we might want to try a generalized additive model. Moving down the list we see a gradient boosting machine algorithm. This algorithm leverages the technique of boosting to learn from its errors and previous iterations, thus making it a very powerful predictive framework and oftentimes resulting in state of the art performance. In contrast to boosting a random forest model takes advantage of a technique called bootstrap aggregation or bagging, which provides predictions by averaging across many small decision trees each fit to random subsets of the data. This bagging approach is a statistical technique that often results in Random Forests superior ability to generalize the data that it hasn't seen before.
Finally, artificial neural networks are another very powerful class of machine learning models. While being around for quite some time, they have seen a recent resurgence in the past few decades with research advances in the advent of deep learning models. Typical error metrics for regression problem might be the root mean squared error mean absolute error, which is expressed in the same units as the target variable, or the mean absolute percent error which is expressed as a percentage.
If our target variable is categorical in nature, then the ML problem is called a classification problem. In a classification problem, there are typically two or more classes that can be identified and predicted across the data. These problems are typically relevant when you'd like to predict which class a data point belongs to, or otherwise seek a propensity score for that class.
In the graphic here, we can see simulated data drawn from three different distributions. Our algorithm would seek to define the best decision boundaries between each of these three classes. Many if not all, of the algorithms from a regression context can be used in a classification problem as well, as long as it's set up correctly. For example, you'd want to use a binomial distribution for a GLM in the context of a classification problem, as opposed to the Gaussian distribution from a regression context. While many of the same regression algorithms can be applied to a classification problem, the appropriate error metrics for a classification problem are very different. Measuring the performance of a classification model often involves tabulating successful and failed for sections into which call the confusion matrix, as well as using common error metrics such as AUC F1 score and accuracy.
Thus far, we've seen both regression and classification types of supervised problems. But here we have an example of an unsupervised learning problem. As we said before, an unsupervised problem is characterized by not having any target variable to guide the algorithms convergence. A common use case for unsupervised is clustering our data. Here we see the same data points clustered into k equals two, three and four different clusters using the k means clustering algorithm. other use cases involve anomaly detection, where the goal is to identify outliers or otherwise anomalous observations and dimensionality reduction by identifying combinations of features across the data.
And isolation forest algorithm might be useful for anomaly detection, while an autoencoder employs a deep learning free framework and can effectively reduce the dimensionality of a dataset by modeling combinations of its features.
So now that we've taken a solid look at machine learning fundamentals and the type of problems we can solve, let's take a look at the tools H2O.ai provides to solve a wide range of machine learning problems.
First of all, what exactly is H2O?
H2O.ai is an industry leading artificial intelligence company that has evolved from widely adopted open source software and a vision to democratize AI. Since 2012, its advanced and open source tools have led h2o to be use it over half the top Fortune 500 organizations and over 20,000 organizations worldwide. The open source library built by H2O is called H2O-3 and while written in highperformance Java code it has flexible API's for both R and Python, as well as a fully functional web GUI called flow. The library enables data scientists and machine learning practitioners to leverage advanced in memory and parallelized algorithms that are useful in many ML problems. Everything from importing a data file to running automatic machine learning is designed to be fast and scalable.
If we take a deeper look at what that means, we can see that H2O-3 provides support for a wide range of machine learning problems. Some key features include data pre processing, such as imputation, or encodings, auto tuning, and early stopping several cross validation techniques, as well as variable importance and performance metrics. A key component of the H2O three libraries to be able to provide model explainability and interpretability. For example, H2O-3 provides visuals such as partial dependence plots, and diagnostic tools such as Shapley values, and residual analysis.
Looking past H2O-3, we see that H2O has also built an extensible framework of other open source technologies for the ML ecosystem for making your own ML app and wave to decreasing time to value with an auto ML model to integrating with other powerful open source technologies like Apache Spark. H2O is open source software provides a rich ecosystem of tools for any data scientists regardless of skill level.
As an example, we can see how H2O can be initialized and both R and Python API's. Here we fit an auto ML model limiting the algorithm runtime to 10 minutes and we review the resulting leaderboard.
Now at this point, you might be asking yourself so if that's H2O-3 what exactly is the H2O.ai Cloud you speak of? In addition to open source software, H2O has developed a state of the art machine learning platform for the enterprise called H2O AI Cloud.
AI cloud is built on three principles in mind, make operate, innovate.
This platform is designed to provide organizations with that agility needed to extensively experiment with their data and finally put AI to work. In terms of these three principles. The AI Cloud is designed to one simplify the making of models you can trust without having to sacrifice accuracy, scale performance, or transparency, to decentralized model operations, and accelerate the deployment of applications regardless of the environment, and three to support innovation with a large marketplace environment that maps AI solutions to potential users and use cases.
We can further break this down into three actions I'm sorry, we can break we can further break these three actions into their key components. Like any machine learning problem, we start with a purpose or an idea in mind. When we go to make with the AI Cloud, we are generating feature transformations, designing machine learning experiments, visualizing model explanations, and ultimately making apps for stakeholders to fully leverage their own machine learning models. When it comes time to put that model into production. The AI Cloud has a powerful MLOps tool to host models, expose endpoints for extremely fast scoring and ultimately track and monitor model performance. In addition to this end to end framework, AI Cloud has its own app store to use in your experiments and to share your own ml apps with those who can truly benefit from them. Underpinning all this is a flexible architecture that is fully scalable, secure, and containerized.
So let's just try to put all the pieces together and apply a bit more structure with the following platform visual. In addition to the platform's UI itself, users can engage with and collaborate across a platform in many different ways. These include wave apps and Jupyter notebooks for example, within the platform since several advanced AI engines including open source H2O-3 of course, Driverless AI for automatic machine learning across time series, natural language processing and tabular problems, Hydrogen Torch for text, image and video use cases, and finally Document AI for document use cases. In addition to these AI engines is H2O's MLOps tool that can fully handle hosting and managing models, even those which are not HTML models such as pi torch, psychic learn ML flow, etc.
A relatively new addition to the AI Cloud is the feature store enabling data science teams to share and leverage powerful features in each other's ML use cases. From an architecture standpoint, containers are orchestrated across a cluster with a powerful open source technology called Kubernetes. Very recently, H2O.ai announced a Fully Managed version of this AI cloud where H2O handles setting everything up for you and manages the cloud infrastructure for you. What this means is that while setting up an advanced AI cloud platform used to take weeks to months now only takes just a few hours.
Okay, so let's see this tool in action.
So here we have H2O AI Cloud.
The landing page with H2O AI Cloud is the App Store. Here users can explore and launch wave apps that can range from expansive dashboards to data engineering tools to custom solutions that leverage and sit on top of H2O state of the art machine learning tools. These Wave apps are available to users regardless of their intended use case. For example, a data scientist primarily using Driverless AI to generate accurate sales forecasts also has access to something like the H2O customer churn app to to additionally predict when when a particular customer may churn. This ecosystem not only facilitates collaboration across use cases and teams, but also fosters innovation and customization of advanced analytical solutions for virtually any use case.
And let's take a look at how to launch one of those AI engines. At the top here we see my AI engines. I'll click on it.
This is enterprise steam. Enterprise Steam is our provisioning tool to launch compute instances, so that you can experiment with your data in both H2O-3 open source clusters as well as Driverless AI clusters. For now we'll just take a look at Driverless AI.
Here we see that I already have a Driverless AI instance running, but just for educational purposes. If we click on Launch Instance, we can see that each instance is fully configurable. You can name your instance, you can definitely do version control for which version of the software you're using. You can also configure the compute capacity for example, you can configure the number of CPU or GPUs available to the instance, you can scale up memory or storage if you're dealing with very large datasets. And maybe importantly, you can also configure the maximum idle time and maximum uptime. Both of these enable you to make sure that you're getting the most efficient use of your instance. For example, you don't want your instance running if it's not really doing anything, and these configurations enable you to auto shut it down, after for example, two hours if it hasn't done any significant work.
For now, we'll take a look at the instance I already have running.
The homepage for Driverless AI is the datasets overview. This tab allows us to import download and manage the data within our driverless instance.
Clicking on the Add dataset we can also drag or drop as well. But we see that getting data into drivers is very easy. We can either upload a file from our computer or enable the provided data connectors in order to access cloud storage such as AWS S3, Google BigQuery as your blob storage or access additional external data sources such as snowflake Hive or SQL database. Users can even upload custom data recipes to further customize their datasets within driverless AI.
For now, we can see that I've already uploaded a dataset called CLV, we can see that it is 8000 rows with 24 columns, and the total size is about two megabytes. If we click on this, we can we can get some options, one of which is to look at some details.
Once a dataset is in driverless, we can also get a handle of its schema and properties. In this view, users can view summary statistics of their data, visualize the distributions of each column, as well as check and if necessary, update the schema the drivers AI and for when an important data set. So you can see here it says auto detect, but I can go in here and change it to whatever I would require. So numerical, text, date, etc. data set that we're using for today's demo focuses on predicting a customer's lifetime value. From the perspective of an auto insurance provider dataset contains demographic data such as education, and gender, financial information, such as income, and employment status, which is over here, as well as insurance policy information such as policy type, and claims amount.
From this view, it's clear to see we have a mix of both categorical and numerical columns in our data.
But I think our focus now would be on the response column. So we're going to focus on a supervised regression problem. And our response column here is going to be called customer lifetime value, we can see that we get the mean min max standard deviation, we can see the unique number of values and we can see that it's a real valued number. Notice we get a nice kind of distribution up here of what that data looks like.
We can also modify our data set right here by developing a custom recipe through live code. By clicking on modified by recipe and we can see that we can upload a data recipe there'll be already built that can modify our data set. Or we can use live code to write a little bit of Python code here to modify the data set as we see fit. Before applying it to the dataset, we can get a preview.
But when we do apply it, it will not overwrite the existing data set that it came from, it'll just simply create a new one. So there's no risk of losing your data. Here. It's simply a pipeline mechanism to modify and create new data.
We can also leverage powerful automatic exploratory data analysis or EDA tools from the datasets tab by again clicking on the dataset and selecting visualize. Driverless AI will automatically generate an array of visualizations to explore the dataset through the auto vis tool. This tool is designed by Dr. Leland Wilkinson, and is based on his seminal work called the grammar of graphics. Some folks might be familiar with the implementation are called GG plot.
And the top left we see a series of graphs that represent recommended transformations. If we click on that, we can see that one of the suggestions here is to take the inverse transformation of monthly premium auto, a column that exists in our data, as well as a log transformation of total claim amount, another column that exists in our data. If we'd like to use these transformations in our experiment, we have the option to do so in the top right. We can also click on Help, which gives us detailed explanation and documentation for exactly how these recommendations were produced and how the visual visualization was render.
By focusing on categorical columns, out of his will also provide distributional box plots, but that might be relevant for the problem at hand.
So in this case, we're seeing policy on the x axis and customer lifetime value on the y axis we get a distribution of customer lifetime value broken out by policy.
We can also see total claim amount broken out by coverage types premium extended or basic.
And finally, we can also look at total claim amount broken out by gender.
Additional visualizations include outlier detection plots, correlation graphs, and heat maps.
Again, once within a particular visualization, clicking on help provide more detail into this specifics of what's being plotted here.
Last custom visualizations are supported here as well, we can go up to click Add graph, we can pick the type of graph that we'd like to look at. For example, I might just want to look at a scatterplot. For right now, maybe I'm interested in income on the x axis and customer lifetime value on the Y axis. We can also pick either point or square for how we visualize it. And we can see that we get a new visualization that plots customer lifetime value against income.
Now one last feature on the datasets tab to focus on before diving into a modeling is the ability to split our data. So again, if we click on CLV, and we say split, we can split our data into training and testing partitions. So we might name here train and hear tests, we can pick the ratio of which to split our data, a common one would be 80% Training 20% Hold out, so we can just throw point eight in there.
Additionally, if there are more advanced use cases, if you're dealing with a time series problem, you can elect to identify the Time column here, and you will be doing time series cross validation. Or if you have a classification problem, especially one that's particularly imbalanced, you might want to make sure that the training and testing data have an equal representation of that response column. So you can call that out here.
As you can see, I've already split our CSV data into training and testing, we see about 80% of the data or rows exist in train with about 20% of the data in test, give or take. So now let's get into some modeling.
If we click on our train dataset, we also have the option to select predict.
Again, for this example, we're going to be focusing on a supervised regression problem. Here the goal is to be able to predict the lifetime value of a customer based on their attributes and other data. The first step in setting up a supervised learning experiment is identifying the target column.
If we select target column and select customer lifetime value.
Once that target column is established, driverless AI already starts learning about the problem we're trying to solve here. By selecting the customer lifetime value column, it's learned that we're trying to solve a regression problem down here, and has provided some default settings for the experiment based on the information it's already gathered. It's also provided an initial plan of attack, so to speak on the left hand side. Let's take a deeper look at these default settings and the subsequent modeling plan.
Generally speaking, a higher value on the accuracy dial will lead to a more precise model. Higher values of time allow for drivers AI to continue working on the problem until it reaches convergence. And finally, interpretability generally controls the search space for families of models a higher value will lead to more interpretable models. Default Values that drivers AI uses tend to work quite well in many situations and are based on years of experience with some of the world's top Kaggle grand master data scientists. While these default values tend to work quite well, the key principle behind driverless AI is the ability to fully configure their modeling efforts. By clicking on expert settings, the user can fully customize how drivers say I run the experiment.
Users can even bring their own modeling recipes to driverless AI in the form of Python code snippets. And there are a large number of publicly available recipes hosted on GitHub. What this means is that if you'd like to additionally include a popular algorithm like cat boost in the space of miles that driverless AI explores, or write your own algorithm, driverless AI will use it to optimize its predictive performance. In addition to using your own model recipes, driverless AI allows users to bring their own objective function or scores. While many common error metrics are provided. Here we see mean absolute error, for example, mean absolute percent error for example, our default is root mean squared error.
Sometimes a problem might call for a particular measure of success. Driverless will then optimize model training based on your own custom objective function. For now we'll choose a default of RMSE.
We can also drop particular columns that we want to exclude from our modeling identify column to be used in custom cross validation, or even identify a column to be used as weights in our experiment. Finally, let's make sure to provide the dataset use for model evaluation and give our experiment a name. So here I'll pick our test data set and the name will be John's model.
Let's click Launch and see what happens.
Now as experiment progresses, we will see our performance with each iteration in the bottom left. As time goes on, this metric will be improving more and more.
Immediately, we start to get notifications that detect us about possible issues with our problem in the way that we've configured it or possible issues with data itself. In this case, we see possible shift detected enter training dataset for a particular feature customer.
Let us know that it's already dropped this feature.
So we have our first initial model here, and we see our result on the left our initial results in terms of the score that we elected root mean squared error for the validation data.
In the bottom middle column, we start to see variable importance measures for the current best model. This helps us to understand what the key drivers are for predicting our response variable. In the bottom right, we can also see a breakdown of performance for the current best model. Since it's a regression problem, we are mostly interested in understanding any patterns in our errors or residuals here, as well as seeing how our predicted values might match up to their actual data points. A high performance forecasting model here should have its actual and predicted values tracked closely with this diagonal line.
In the top right, we can check any notifications from driverless interim results from the experiment, as well as track our CPU and memory resources.
Finally, in the top middle pane, we can track our experiments overall progress, as well as check our experiments relative share, I'm sorry, relative use of CPU or memory.
Let's let this run for a while and check out an experiment that's already completed.
So I've already completed a supervised regression problem using that train dataset.
If we take a look at this, we can see that the experiment has completed.
When an experiment is completed, the top middle pane provides a set of actions we can engage with to understand the results of the experiment. There are many actions the user can take to interact and understand their model better, but we'll just take a look at a few key ones. From this screen, we can click on visualize scoring pipeline. This is something that I like to do a lot. It gives us a visualization for the best model that drivers ultimately came up with. We can see the process from the feature engineering to the modeling AI. In this case, a stacked ensemble of several models was used to eventually output our predicted customer lifetime value. We can also dig in a little more by clicking on one of these models within each one of these models is its own modeling pipeline. And within this one, we can look at the model ultimately use in this case, an XG boost decision tree model.
Notably, we can also download predictions for both the training and test datasets. When we click on this, we're given the option to pick which columns we'd also like to include in our output.
And lastly, I'd like to draw your attention to the Download auto doc button. I found this to be really useful. When clicking this, you're going to download a Word document, a very thorough Word document that really breaks down every single thing that the experiment really did, what the data it saw, its methodology, its performance on cross validation, as well as the testing data set if you provided that as well as the results of feature engineering.
The idea here is that when you run an experiment, you need to be able to explain it to project stakeholders. So this is where that model evaluation phase comes in. And this is really, really important to provide that documentation to show exactly what happened during the process.
And with that, that's a light touch of how to use H2O AI for supervised regression problems.