March 6th, 2019

Machine Learning with H2O – the Benefits of VMware

RSS icon RSS Category: Cloud, Community, H2O Driverless AI

This blog was originally posted by Justin Murray of VMware and can be accessed here.


This brief article introduces a short 4.5 minute video that explains the reasons why VMware vSphere is a great platform for data scientists/engineers to use as their base operating platform. The video then demonstrates an example of this, showing a data scientist conducting a modeling experiment with an input set of data, while using the Driverless AI tool from to do the data analysis and model training, all in VMs. The key idea here is that the world of machine learning/data science is rapidly changing, with new, powerful tools, platforms and versions appearing and upgrading at a very fast pace. The tool vendors are racing to innovate here and producing new workbenches for the both the expert and the novice in the field.

Data scientists and data engineers (who organize and cleanse the data first) want to be able to try out these new tools and updated versions of the tools while keeping a stable environment for their existing production deployments. The end goal is to produce a highly accurate trained ML model as quickly as they can for any input data set and predicted outcome. One measure of accuracy you will see in use at the end of the tool demo is the ROC (or Receiver Operating Characteristic) curve – but other measures of model accuracy are also available. That trained ML model will subsequently be used in production applications for the inference phase. The example model that is chosen here is called XGBoost- a popular algorithm for certain kinds of data. The ML inference phase (or production deployment of the model in a pipeline) is often concerned with classification of something, such as a fraudulent transaction, or prediction of what may happen in the future, such as the likelihood someone will not pay their credit card bill – or will choose a related book or movie. The ML practitioners (data scientists and data engineers) want to use the best tools and platforms they can get their hands on in order to build the best trained model that will be able to recognize these patterns.

This rapid change of tooling places a significant demand on an IT department, just to keep up with the innovation and satisfy their customer, the data scientists and data engineers, by giving them what they want, while maintaining some control. To achieve this, deploying on VMware vSphere gives them the ability to create different sandboxes for the data scientists to work in, each contained in one or more virtual machines. This provides isolation, checkpointing and the ability for the data scientist to innovate in a safe environment.

While many well-known examples of machine learning focus on solving problems to do with image recognition and classification, this particular H2O tool is being used in our demo to analyze tabular data, which happens to be contained in a CSV file in this example. This data is representative of many thousands of datasets that are found in enterprises, such as in database tables, spreadsheets and regular human-readable files. A term that is used frequently here is “independent and identically distributed” data or IID. This kind of data is structured into rows and columns (which is much different to the layout of pixels in an image) so the ML models that best analyze IID/tabular data may well be different to those models that deal with images. Financial institutions, insurance companies, retail operations and dozens of other enterprises have lots of this “tabular” data – so there is a big opportunity here for machine learning to be applied to this type of dataset so as to enhance these enterprises’ business understanding of their customers. These types of data may also be somewhat sensitive and so will likely be modeled in-house for the foreseeable future.


About the Author

vinod iyengar
Vinod Iyengar, VP of Products

Vinod is VP of Products at He leads all product marketing efforts, new product development and integrations with partners. Vinod comes with over 10 years of Marketing & Data Science experience in multiple startups. He was the founding employee for his previous startup, Activehours (Earnin), where he helped build the product and bootstrap the user acquisition with growth hacking. He has worked to grow the user base for his companies from almost nothing to millions of customers. He’s built models to score leads, reduce churn, increase conversion, prevent fraud and many more use cases. He brings a strong analytical side and a metrics driven approach to marketing. When he is not busy hacking, Vinod loves painting and reading. He is a huge foodie and will eat anything that doesn’t crawl, swim or move.

Leave a Reply

H2O Wave joins Hacktoberfest

It’s that time of the year again. A great initiative by DigitalOcean called Hacktoberfest that aims to bring

September 29, 2022 - by Martin Turoci
Three Keys to Ethical Artificial Intelligence in Your Organization

There’s certainly been no shortage of examples of AI gone bad over the past few

September 23, 2022 - by Team
Using GraphQL, HTTPX, and asyncio in H2O Wave

Today, I would like to cover the most basic use case for H2O Wave, which is

September 21, 2022 - by Martin Turoci
머신러닝 자동화 솔루션 H2O Driveless AI를 이용한 뇌에서의 성차 예측

Predicting Gender Differences in the Brain Using Machine Learning Automation Solution H2O Driverless AI 아동기 뇌인지

August 29, 2022 - by Team
Make with Recap: Validation Scheme Best Practices

Data Scientist and Kaggle Grandmaster, Dmitry Gordeev, presented at the Make with session on

August 23, 2022 - by Blair Averett
Integrating VSCode editor into H2O Wave

Let’s have a look at how to provide our users with a truly amazing experience

August 18, 2022 - by Martin Turoci

Start Your Free Trial