March 6th, 2019

Machine Learning with H2O – the Benefits of VMware

RSS icon RSS Category: Cloud, Community, H2O Driverless AI

This blog was originally posted by Justin Murray of VMware and can be accessed here.

 

This brief article introduces a short 4.5 minute video that explains the reasons why VMware vSphere is a great platform for data scientists/engineers to use as their base operating platform. The video then demonstrates an example of this, showing a data scientist conducting a modeling experiment with an input set of data, while using the Driverless AI tool from H2O.ai to do the data analysis and model training, all in VMs. The key idea here is that the world of machine learning/data science is rapidly changing, with new, powerful tools, platforms and versions appearing and upgrading at a very fast pace. The tool vendors are racing to innovate here and producing new workbenches for the both the expert and the novice in the field.

Data scientists and data engineers (who organize and cleanse the data first) want to be able to try out these new tools and updated versions of the tools while keeping a stable environment for their existing production deployments. The end goal is to produce a highly accurate trained ML model as quickly as they can for any input data set and predicted outcome. One measure of accuracy you will see in use at the end of the tool demo is the ROC (or Receiver Operating Characteristic) curve – but other measures of model accuracy are also available. That trained ML model will subsequently be used in production applications for the inference phase. The example model that is chosen here is called XGBoost- a popular algorithm for certain kinds of data. The ML inference phase (or production deployment of the model in a pipeline) is often concerned with classification of something, such as a fraudulent transaction, or prediction of what may happen in the future, such as the likelihood someone will not pay their credit card bill – or will choose a related book or movie. The ML practitioners (data scientists and data engineers) want to use the best tools and platforms they can get their hands on in order to build the best trained model that will be able to recognize these patterns.

This rapid change of tooling places a significant demand on an IT department, just to keep up with the innovation and satisfy their customer, the data scientists and data engineers, by giving them what they want, while maintaining some control. To achieve this, deploying on VMware vSphere gives them the ability to create different sandboxes for the data scientists to work in, each contained in one or more virtual machines. This provides isolation, checkpointing and the ability for the data scientist to innovate in a safe environment.

While many well-known examples of machine learning focus on solving problems to do with image recognition and classification, this particular H2O tool is being used in our demo to analyze tabular data, which happens to be contained in a CSV file in this example. This data is representative of many thousands of datasets that are found in enterprises, such as in database tables, spreadsheets and regular human-readable files. A term that is used frequently here is “independent and identically distributed” data or IID. This kind of data is structured into rows and columns (which is much different to the layout of pixels in an image) so the ML models that best analyze IID/tabular data may well be different to those models that deal with images. Financial institutions, insurance companies, retail operations and dozens of other enterprises have lots of this “tabular” data – so there is a big opportunity here for machine learning to be applied to this type of dataset so as to enhance these enterprises’ business understanding of their customers. These types of data may also be somewhat sensitive and so will likely be modeled in-house for the foreseeable future.

 

About the Author

vinod iyengar
Vinod Iyengar, VP of Products

Vinod is VP of Products at H2O.ai. He leads all product marketing efforts, new product development and integrations with partners. Vinod comes with over 10 years of Marketing & Data Science experience in multiple startups. He was the founding employee for his previous startup, Activehours (Earnin), where he helped build the product and bootstrap the user acquisition with growth hacking. He has worked to grow the user base for his companies from almost nothing to millions of customers. He’s built models to score leads, reduce churn, increase conversion, prevent fraud and many more use cases. He brings a strong analytical side and a metrics driven approach to marketing. When he is not busy hacking, Vinod loves painting and reading. He is a huge foodie and will eat anything that doesn’t crawl, swim or move.

Leave a Reply

+
H2O LLM DataStudio Part II: Convert Documents to QA Pairs for fine tuning of LLMs

Convert unstructured datasets to Question-answer pairs required for LLM fine-tuning and other downstream tasks with

September 22, 2023 - by Genevieve Richards, Tarique Hussain and Shivam Bansal
+
Building a Fraud Detection Model with H2O AI Cloud

In a previous article[1], we discussed how machine learning could be harnessed to mitigate fraud.

July 28, 2023 - by Asghar Ghorbani
+
A Look at the UniformRobust Method for Histogram Type

Tree-based algorithms, especially Gradient Boosting Machines (GBM's), are one of the most popular algorithms used.

July 25, 2023 - by Hannah Tillman and Megan Kurka
+
H2O LLM EvalGPT: A Comprehensive Tool for Evaluating Large Language Models

In an era where Large Language Models (LLMs) are rapidly gaining traction for diverse applications,

July 19, 2023 - by Srinivas Neppalli, Abhay Singhal and Michal Malohlava
+
Testing Large Language Model (LLM) Vulnerabilities Using Adversarial Attacks

Adversarial analysis seeks to explain a machine learning model by understanding locally what changes need

July 19, 2023 - by Kim Montgomery, Pramit Choudhary and Michal Malohlava
+
Reducing False Positives in Financial Transactions with AutoML

In an increasingly digital world, combating financial fraud is a high-stakes game. However, the systems

July 14, 2023 - by Asghar Ghorbani

Ready to see the H2O.ai platform in action?

Make data and AI deliver meaningful and significant value to your organization with our state-of-the-art AI platform.