Return to page

H2O-3 Open Source, Distributed Machine Learning for Everyone at H2O World Sydney 2022





H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.



Talking Points:



Read the Full Transcript

Rohan Rao:


Hope everyone can hear me. Awesome. I'll be talking about H2O-3, which is our open source, machine learning toolkit. Very quick agenda. I'll do a quick introduction followed by what H2O-3 is. What are the different components? What does the entire platform look like? How data is stored in our H2O frame, and what are the different models and algorithms that we support? All the algorithms we bring together in our AutoML tool. Finally, I'll conclude with some final thoughts and maybe last a few minutes for a quick Q&A.


Quick introduction on me. I've been in machine learning for close to nine years now. Writing, coding, ML algorithms. My background is primarily in statistics, so I picked up coding and programming after coming out of college. I've spent an insane amount of time on Kaggle, probably more than any other hobby that I've had. I'm also a quadruple Kaggle Grandmaster. In case you were wondering, why is this quadruple there? Kaggle has four categories, so you can be a Grandmaster in any of the four. I'm happy that I spent time and I managed to do it in all four. I also do data science mentorship and a couple of advisory gigs. In general, I'm a coder for life, so anytime I have free time, or I don't know what to do, I'll sort of quote something, and I love reading as well–reading, writing.




What is H2O-3? It's open source, distributed, and fast. It's a scalable machine learning platform. Yes, open source means it's free. Everybody loves free stuff. Everything that is in H2O-3 is completely available and free to use. It's on our GitHub. There are a lot of different ways in which our users and the community have used it. I'll touch upon some of the most popular ones.


Why three? In fact, recently, someone asked me this question, so I thought I'll just add a quick slide on this. Nothing fancy here. It's our third incarnation of H2O. The Legacy H2O was the first version, and then we made a lot of improvements. Then there came H2O-2, and we, again, took in a lot of feedback from the user community. Now we have H2O-3. H2O-3, I think, has been running for close to eight years now. I would say our most stable version. So far, all the improvements and features that we plan to add are all going to be part of H2O-3. I don't know if there's four coming or not.


I would say the number one feature of H2O-3 is its distribution. It works multi-node as well as multi-cluster. Very easy to scale any pipeline that you build with H2O. All the code algorithms are the actual code written in Java.


Every time I say Java, data scientists come to me and say that if we code in Python, we want something in Jupyter Notebooks. We do have APIs in Python. We have it in R. We also have it in Scala. If you don't like coding or if you want to try out H2O-3 but don't want to get really hands on, there's also Web GUI, where you can drag, drop, click buttons and build modules or build your pipelines using H2O. Once you do things with H2O, the natural next step is you have a pipeline or you have a model you want to put in production. Any model that you build they're available as artifacts; it's all written in Java, so it's very easy to productionize and deploy them.


I touch upon each of these parts in a little bit of detail. Obviously, you can run H2O pretty much in any environment, whether it's on Hadoop or Spark. You can even launch any server on a cloud platform. Lot of folks even run H2O just on their laptop, including me. I think the first few times when I used H2O. It was just starting H2O and on my MacBook Air, which is hardly four GB of ram. It still is capable of really manipulating and building models on much larger sized data sets.


H2O Platform


The H2O platform. What's the generic flow of using H2O-3? It all starts with data, and you can import, or you can upload, or you can stream data from a lot of different sources. These are some of the most popular ones. Whether it's HDFS, or if you have local CSV files or a SQL database. H2O has a lot of different ways in which you can bring data into the H2O-3 server. Then the actual compute, once you have data in H2O-3, you can run transformations, you can manipulate the data, you can pivot it, you can group it, you can slice it, and dice it. Basically, any transformation or any kind of aggregation that you want to do on the data sets. They can all be done from inside the H2O server. Now, once you have your data sets ready, then obviously, you can run a lot of different models, algorithms on it. Those algorithms are then available. You can code it in a lot of different APIs that we have. Even if you don't want to use, let's say, a language, you can even interact with H2O-3 directly using the REST API.


We also have integrations with Tableau and KNIME. They are analytical software. If you have your data sets, for example, in Tableau, where you know you want to visualize or create graphs or do your analysis on it. Then let's say you want to build models. You can just say, click and build a model.


All the models are available as artifacts. POJO and MOJO are the two popular formats in which you get the model outputs. You can deploy it across a lot of different environments. The most popular ones are Spark and Snowflake. We also have our own H2O MLOps, which our customers now use to deploy the models that they build. Even if you have, let's say, your own production environment setup with mlFlow running for experiment tracking or any other setup that you have, it's easy because the model artifacts are you can download it, you can export it, and you can put it in your production environment. Finally, how do you scale this since it's distributed, which means it's multi-node, it's multi-core there is no limit to it. You can just keep adding nodes or adding clusters to scale as per need of the size of your data or the resources that the modeling requires.


H2O Distributed Computing


Quick summary on how distributed computing works. For some of you, if you know Spark, the setup is similar to the way Spark is set up. It has a multi-node cluster with a shared memory model. All the computations happen in memory. Since the data is split across multiple nodes, they're all available as a key value pair to access or point to whether it's the models or whether it's the actual data. You can always scale it. The natural way to do it is to start small. You try out experiments, pushing your data, and as you start hitting sort of limits, then you start incrementally scaling up.


H2O Frame


The data that's stored in H2O is in what we call the H2O Frame. The different data sources that we saw, when you input the data into the H2O server, they're all structured as H2O Frames. Essentially, they are just sets of column vectors, which are distributed across nodes. Now when you think about Frame, I think a lot of Python users would be very comfortable working with pandas, and R users would be comfortable working with data.frame or even Data Table. This structure is very similar, a lot of the things that you do with pandas or data.table, you can also do with H2O Frame. All the usual processing steps, whether it's about your row operations, which is filtering or sorting, or whether it's column operations of grouping it or pivoting it, or just slicing and dicing it in different forms in fashion. We also have native support for certain string operations. These are two examples. Word2vec is natively available in H2O, as well as target-encoding. These are typically used if you want to dig deep into advanced feature engineering to just try out and see how they work with different models.


H2O Algorithms (Supervised and Unsupervised)


Coming to the models, there are a wide range that we support today. The most basic ones is the classic linear models, whether it's GLM, whether it's the GAMs, SVMs, or NavieBayes, the most standard popular ones. For Tree base models, we have boosting ones. Some of you might have used either a scale on Random Forest, or even R has a Random Forest. In H2O, it's a very similar implementation, but it runs in a distributed fashion because every tree that you build in a random forest is independent of the other tree.


You can learn MLP models, which are part of the deep learning framework. We also have a stacked ensemble. Typically whenever you run models or whenever you run training on a data set, the assumption is no single model is going to be perfect. Even if you look at a lot of the top Kaggle solutions or the winning models, they are typically ensembles of different models that capture different information and trends. We have an automated stacked ensemble that combines these individual models into a level two stack model. There are also some simpler RuleFit and Isotonic Regression models that can also be tried. All of this is also packaged in an automated fashion in our H2O AutoML. I'll touch upon that after the next slide.


We also have a lot of unsupervised model support. Clustering is probably the most common use case. There's K-means, which you can use for clustering. For dimensionality reduction, there's Aggregators, which reduces the number of rows. There is PCA, which reduces the number of columns. They are useful for reducing your size of data. There's also GLRM, so GLRM is a form of matrix factorization, which again condenses your feature space into a set of latent vectors.


Anomaly detection is useful for a lot of use cases where you want to, let's say, find outliers in the data. We have an Isolation Forest. We also have a newer version of Isolation Forest based on a new research paper that's come out. I think it's called extended Isolation Forest. There are some NLP transformations which you could also natively run on your data. TF-IDF and Work2vec, these are both readily available. All of these techniques, I mean, it's unsupervised because it doesn't really depend on a target variable. Any data set that you have if you want to just understand it better, or let's say visualize clusters or maybe pick outliers, you could even run these pipelines.


H2O AutoML


Let's talk a bit about AutoML. Firstly, why do we really need AutoML? Since so many of us as a data scientists, and I'm sure some of you here would also understand, you work on different projects or you work on one data set to solve different problems, and over time you realize that there are a lot of steps that are very repetitive, and they can be easily optimized and automated. We package all of it together in H2O AutoML. Processing of data. Even a simple thing as if you have, let's say, a categorical variable in your data, now different models work differently. Some models, for example, let's say XGBoost, XGBoost works really well when categorical variables are labeled encoded. But let's say if you want to run an elastic regression or a GLM in those scenarios, your categorical variable works much better if it's one hard encoded.


Another example is models like XGBoost handle missing values automatically, whereas for Random Forest or many of the linear models, you are required to impute the missing values before plugging it into the algorithm. There are a lot of these pre-processing steps that are required for each of these different models. H2O AutoML does it automatically.


It trains a lot of different models. When you're given a problem statement and a data set, there's no hard and fast rule that for this dataset, this algorithm is going to work, or for this, another model is going to work better. You would never know that until you really try. H2O AutoML tries out as many different models as it can. It also builds a stacked ensemble combining the top few or the top best models that it finds. Tuning of hyper-parameters. Each model, again, has a different set of hyper-parameters, and in any model, the hyper-parameters can be tuned and should be tuned in order to squeeze the maximum performance from the model. AutoML does that internally as well. The final models that you get from AutoML they're again available as artifacts which you can download and you can put into production. Finally, we also have explainability of models, which, time and again, comes as a very important factor in not just deciding which model should go into production, whether it's for compliance purposes or whether it's a business requirement. It also helps in debugging errors or truly understanding when a production is made. How does the model come to that particular prediction?


I'm going to show you a couple of examples. There are very simple snippets of code of how you would be turning H2O in Python, or R. Installing is super simple. In R, it's just you install the package H2O, and in Python, you can just do a PIP install. You import the library, and you initialize the H2O cluster, the H2O server, so h2o.init does that. You import your data sets. In this example, import file imports a local train CSV file. You define your AutoML. This is an example where AutoML is given 600 seconds to run. You can define how long you want your AutoML to run. Obviously, the larger, the longer time period you give, the more things AutoML can try, and typically the better the models are.


Finally, what you get is this nice leaderboard. You get something like this, which shows you the different models that were tried. With each model, what were the validation metrics that were computed. This gives you a nice view of how these different models, you can even compare just by looking at this. Also, maybe for some models, certain metrics might look better, and for others, there could be other metrics. You could choose based on the leaderboard, which model you would like to finally use or productionize.


Coming to explainability. Once your AutoML is trained, you can just run a one line command saying H2O model.explain. It would give you a wide range of explainability, graphs, and outputs. These are just some examples just to give you a flavor of what you could expect. For example, you have here the variable importance heat map. There's also a correlation heat map that gets generated. The one on the top right is the PDP plot, which is the partial dependence plot. In this example, it takes the top feature, which is light, and it shows you the average model output for different values of light. The one on the bottom left is the SHAP values, and the one on the bottom right is the permutation variable importance. If you look at these, these are all different ways of explaining the model. Different companies, different projects, different people would use different components to explain or define how or what a particular model is performing.


These are examples of explainability at a global level, which is overall what this model is for this particular data set. There is also individual level or as we call it, sometimes local explainability, which is when you run the model on a particular row of the data to get a prediction. These are a couple of examples of explanations for that particular row. The one on the left is the ICE plot, and the one on the right is the SHAP value. The SHAP values for a particular chosen row That's why if you see on the right the SHAP, it's really small, but it has the feature name as well as the feature value, it's exactly for that particular row. Remember, all of this is automatically generated. It's just a one line command that you need to run.


H2O Resources


These are some of the resources if you want to learn more if you want to understand more. We have comprehensive documentation. Our entire GitHub is open source, it's available. There's a YouTube channel, which has a few nice videos. Stack Overflow has a lot of resources for people trying out stuff and facing issues and them getting solved.


We recently also launched our new H2O community, so feel free to jump in there and discuss any questions or any projects that you're working on. I'm going to conclude here with one final thought. It's something I strongly believe in, and even personally, I've used it quite often. I've been on the other side many times hearing folks talk about libraries or research papers or new pre-trained modules, or new approaches to solving something.


 I'm going to sort of leave you with this. It's always great when you want to understand or really want to learn more about any library, any package, just try it out, hands on. There's always good things that people tell, but when you really experience it, you get a true understanding of what the library and what the platform has to offer. It's as easy as spinning it up on your laptop, one line to install the library, one line to import the package, the module, and just a couple of lines to start using the code.


Q&A With Rohan


I'll stop here. I'll be happy to take any quick questions if any. This is my Twitter handle. I'm super active on Twitter. I share a lot of educational content, a lot of tutorials, lot of examples of not just H2O-3 but also some examples of Driverless AI. Recently, I've been posting a lot of example apps of H2O Wave. H2O Wave is also one of our other open source offerings for writing AI apps using a low code library.


A couple of quick questions. Does feature engineering get bundled inside the MOJO too? If you're talking about feature engineering, like the pre-processing that happens before it goes into the model, yes, that is included, but anything that you do externally, for example, there are certain types of preprocessing where we do not support modules yet. They are a work in progress, and eventually, we will try to get as many things inside the module as possible. But if you run anything with, I think, AutoML today, I think 95% of the time, the module should have everything that you need.


Does H2O-3 work with federated settings, where data is stored in different locations? As long as the server can connect to the set of nodes, it does not matter where the nodes really are. I don't know if that answers the exact question being asked, but from my understanding, it should.


Regarding models in Tableau, where does the processing happen? Is it Tableau Server or H2O Cloud? H2O-3 does not really have a cloud, basically the server. I mean, the instance or the environment you are in, everything runs right there.


How does AutoML handle overfitting? AutoML has most of the common techniques of handling overfitting, whether it's regularization or whether it's looking at the validation scores over time. In fact, a lot of the benchmarks that we have tested AutoML with include the ones which are in the AutoML paper that was published in 2018. It covers and touches upon most of these aspects.


Are models exported from H2O-3 locked behind a license? No, the H2O-3 modules are free to use, they're open to use. But that being said, a lot of companies, if you really want to put this in production, we do even offer enterprise support for H2O-3. A lot of our customers tend to prefer deploying their models along with the sup support that we offer.


That's it from me today. Thank you so much, and have a good day.