June 9th, 2021

H2O Integrates with Snowflake Snowpark/Java UDFs: How to better leverage the Snowflake Data Marketplace and deploy In-Database

RSS icon RSS Category: H2O AI Cloud, H2O Driverless AI, Partners, Snowflake, Technical
Snowflake on H2O.ai

One of the goals of machine learning is to find unknown predictive features, even hidden from subject matter experts, in datasets that might not be apparent before, and use those 3rd party features to increase the accuracy of the model.

A traditional way of doing this was to try and scrape and scour distributed, stagnant data sources on the web and cross your fingers you would see some performance gain when doing model validation. An example of this would be scraping stock prices or GDP prices to see if the economy had an effect on things. One potentially better solution was Snowflake’s Data Marketplace, which allowed people to more easily share up-to-date datasets, but the requirement to work in SQL made it troublesome.

Today, we are excited to announce integrations with Snowflake’s Snowpark and Java UDFs, Snowflake’s new developer experience. This essentially opens up access to Snowflake’s Data Marketplace to data scientists and developers, now allowing 3rd party unique datasets that can help improve model accuracy and deliver better business outcomes to be included in their often standard notebook based programming approaches.

Snowpark is a client-side library that pushes client-side logic into Snowflake. Within a Virtual Warehouse, Java User Defined Functions (UDFs) can be run that contains your choice of code.

Snowflake’s new feature Snowpark enables the data in Snowflake to be available as a dataframe using Scala, and the Java UDF feature allows Java code to be executed as a user defined function (UDF) within the Snowflake environment. Meaning once we trained our model on our data plus the 3rd party data we like in the Snowflake marketplace, we can import the model into where the data lives for an easier path to deployment. For the data scientist, this opens up an amazing opportunity to use familiar H2O.ai tools and models, but use the power and scaling of the Snowflake platform.

To show an example, we will discuss how to use the Snowflake Data Marketplace to access a dataset from a company, Infutor Data Solutions, that has shared demographic data, and we want to use that with our existing dataset (to predict loan defaults with LendingClub data) in Snowflake.

The first step is to create a table based on the Infutor dataset. Then, we will average some columns and group them, so that they can be used.

val DataMarketPlaceDF = session.sql("CREATE or REPLACE TABLE DataMarketPlacePropertyAverage AS SELECT STATE, ROUND(avg(PROP_VALCALC),0) as PROP_VALCALC FROM Infutor_AV10424_TOTAL_DEMOGRAPHIC_PROFILES_100 GROUP BY State;")

The Snowflake marketplace data can then be used as a reference table. We can use the Snowpark API to create this reference table:

val PropertyAverageDF = session.table("DataMarketPlacePropertyAverage")


println("Number of rows created at State level: %s",PropertyAverageDF.count())

Snowpark makes it simple to join the two tables:

// Create Dataframe with Lending Club Data

val LendingDF = session.table("LENDINGCLUB")


// Join Dataframes based on State

val enrichedDF =  LendingDF.join(PropertyAverageDF,
  LendingDF.col("ADDR_STATE") === PropertyAverageDF.col("STATE"))

The table will now contain the average property value for each row in the original table. All of these operations occur in Snowflake, so scaling and the time to complete this operation is very fast.

The next step is to save this new dataframe and invoke Driverless AI training:

// Save new Dataframe to table and start training


println("Calling Training with new Table")

val training = session.sql("select H2OTrain('Build --Table=LENDINGCLUBwithPropertyAverage --Target=BAD_LOAN --Clone=Yes --Modelname=riskmodelPropertyAverage.mojo')")


This operation uses another unique Snowflake feature, Snowflake External Functions, which enables a table to be used with Driverless AI. The result is a model that is built using the advanced feature engineering and modeling capabilities offered by Driverless AI.

Once the model is trained, it is automatically deployed. It is available within the Snowflake environment, and we are now able to use the model directly in Snowflake using the Snowpark and Java UDF features.

// call the model using SQL until Java UDF is auto deployed, then use session.table()



The model executes and makes predictions in Snowflake. The key theme here is: the data does not leave the Snowflake environment, and it automatically scales with the size of the warehouse you’ve selected.

What is nice is that the Snowpark Scala and Java UDF code is all auto-generated for the Data Scientist.

Using the Infutor data, we were able to improve the model’s accuracy and score at scale within the Snowflake environment.


About the Authors

Eric Gudgion

Eric is a Senior Principal Solutions Architect at H2O.ai and works with customers on deploying machine learning models into production environments.


Miles Adkins

Miles is a Partner Sales Engineer at Snowflake and leads the joint technical go-to-market for Snowflake’s machine learning and data science partners.

About the Author

Eric Gudgion
Eric Gudgion

Eric is a Senior Principal Solutions Architect, he is passionate about performance and scalability. Eric’s role enables him to help customers adopt h2o within their enterprises.

Leave a Reply

[Infographic] Healthcare providers: How to avoid AI “Pilot-Itis”

From increased clinician burnout and financial instability to delays in elective and preventative care, the

March 15, 2023 - by
Deploy a WAVE app on an AWS EC2 instance

This article was originally published by Greg Fousas and Michelle Tanco on Medium  and reviewed by

March 10, 2023 - by Michelle Tanco and Greg Fousas
How Horse Racing Predictions with H2O.ai Saved a Local Insurance Company $8M a Year

In this Technical Track session at H2O World Sydney 2022, SimplyAI's Chief Data Scientist Matthew

March 8, 2023 - by Liz Pratusevich
AI and Humans Combating Extinction Together with Dr. Tanya Berger-Wolf

Dr. Tanya Berger-Wolf, Co-Founder and Director of AI for conservation nonprofit Wild Me, takes the

March 1, 2023 - by Liz Pratusevich
Improving Search Query Accuracy: A Beginner’s Guide to Text Regression with H2O Hydrogen Torch

Although search engines are vital to our daily lives, they need help understanding complex user

February 28, 2023 - by
Blog header image with boats on it
What it means—and takes—to be at AI’s edge with Dr. Tim Fountaine

Dr. Tim Fountaine, Senior Partner at McKinsey & Company joins us at H2O World Sydney

February 23, 2023 - by Liz Pratusevich

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More