March 6th, 2019

Machine Learning with H2O – the Benefits of VMware

RSS icon RSS Category: Cloud, Community, H2O Driverless AI

This blog was originally posted by Justin Murray of VMware and can be accessed here.

 

This brief article introduces a short 4.5 minute video that explains the reasons why VMware vSphere is a great platform for data scientists/engineers to use as their base operating platform. The video then demonstrates an example of this, showing a data scientist conducting a modeling experiment with an input set of data, while using the Driverless AI tool from H2O.ai to do the data analysis and model training, all in VMs. The key idea here is that the world of machine learning/data science is rapidly changing, with new, powerful tools, platforms and versions appearing and upgrading at a very fast pace. The tool vendors are racing to innovate here and producing new workbenches for the both the expert and the novice in the field.

Data scientists and data engineers (who organize and cleanse the data first) want to be able to try out these new tools and updated versions of the tools while keeping a stable environment for their existing production deployments. The end goal is to produce a highly accurate trained ML model as quickly as they can for any input data set and predicted outcome. One measure of accuracy you will see in use at the end of the tool demo is the ROC (or Receiver Operating Characteristic) curve – but other measures of model accuracy are also available. That trained ML model will subsequently be used in production applications for the inference phase. The example model that is chosen here is called XGBoost- a popular algorithm for certain kinds of data. The ML inference phase (or production deployment of the model in a pipeline) is often concerned with classification of something, such as a fraudulent transaction, or prediction of what may happen in the future, such as the likelihood someone will not pay their credit card bill – or will choose a related book or movie. The ML practitioners (data scientists and data engineers) want to use the best tools and platforms they can get their hands on in order to build the best trained model that will be able to recognize these patterns.

This rapid change of tooling places a significant demand on an IT department, just to keep up with the innovation and satisfy their customer, the data scientists and data engineers, by giving them what they want, while maintaining some control. To achieve this, deploying on VMware vSphere gives them the ability to create different sandboxes for the data scientists to work in, each contained in one or more virtual machines. This provides isolation, checkpointing and the ability for the data scientist to innovate in a safe environment.

While many well-known examples of machine learning focus on solving problems to do with image recognition and classification, this particular H2O tool is being used in our demo to analyze tabular data, which happens to be contained in a CSV file in this example. This data is representative of many thousands of datasets that are found in enterprises, such as in database tables, spreadsheets and regular human-readable files. A term that is used frequently here is “independent and identically distributed” data or IID. This kind of data is structured into rows and columns (which is much different to the layout of pixels in an image) so the ML models that best analyze IID/tabular data may well be different to those models that deal with images. Financial institutions, insurance companies, retail operations and dozens of other enterprises have lots of this “tabular” data – so there is a big opportunity here for machine learning to be applied to this type of dataset so as to enhance these enterprises’ business understanding of their customers. These types of data may also be somewhat sensitive and so will likely be modeled in-house for the foreseeable future.

 

About the Author

vinod iyengar
Vinod Iyengar, VP of Products

Vinod is VP of Products at H2O.ai. He leads all product marketing efforts, new product development and integrations with partners. Vinod comes with over 10 years of Marketing & Data Science experience in multiple startups. He was the founding employee for his previous startup, Activehours (Earnin), where he helped build the product and bootstrap the user acquisition with growth hacking. He has worked to grow the user base for his companies from almost nothing to millions of customers. He’s built models to score leads, reduce churn, increase conversion, prevent fraud and many more use cases. He brings a strong analytical side and a metrics driven approach to marketing. When he is not busy hacking, Vinod loves painting and reading. He is a huge foodie and will eat anything that doesn’t crawl, swim or move.

Leave a Reply

+
Enhancing H2O Model Validation App with h2oGPT Integration

As machine learning practitioners, we’re always on the lookout for innovative ways to streamline and

May 17, 2023 - by Parul Pandey
+
Building a Manufacturing Product Defect Classification Model and Application using H2O Hydrogen Torch, H2O MLOps, and H2O Wave

Primary Authors: Nishaanthini Gnanavel and Genevieve Richards Effective product quality control is of utmost importance in

May 15, 2023 - by Shivam Bansal
AI for Good hackathon
+
Insights from AI for Good Hackathon: Using Machine Learning to Tackle Pollution

At H2O.ai, we believe technology can be a force for good, and we're committed to

May 10, 2023 - by Parul Pandey and Shivam Bansal
H2O democratizing LLMs
+
Democratization of LLMs

Every organization needs to own its GPT as simply as we need to own our

May 8, 2023 - by Sri Ambati
h2oGPT blog header
+
Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey

At H2O.ai, we pride ourselves on developing world-class Machine Learning, Deep Learning, and AI platforms.

May 3, 2023 - by Arno Candel
LLM blog header
+
Effortless Fine-Tuning of Large Language Models with Open-Source H2O LLM Studio

While the pace at which Large Language Models (LLMs) have been driving breakthroughs is remarkable,

May 1, 2023 - by Parul Pandey

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More