Machine Learning with Sparkling Water: H2O + Spark

January 2023: Fifth Edition

Section	Title	Page
1	What is H2O?	6
2	Sparkling Water Introduction	8
2.1	Typical Use Cases	8
2.1.1	Model Building	8
2.1.2	Data Munging	9
2.1.3	Stream Processing	9
2.2	Features	11
2.3	Supported Data Sources	11
2.4	Supported Data Formats	11
2.5	Supported Spark Execution Environments	12
2.6	Sparkling Water Clients	12
2.7	Sparkling Water Requirements	13
3	Design	14
3.1	Data Sharing between H2O and Spark	15
3.2	H2OContext	15
4	Starting Sparkling Water	17
4.1	Setting Up The Environment	17
4.2	Starting Interactive Shell with Sparkling Water	17
4.4	Starting Sparkling Water with Internal Backend	18
4.4	External Backend	19
4.4.1	Automatic Mode of External Backend	19
4.4.2	Manual Mode of External Backend on Hadoop	21
4.4.3	Manual Mode of External Backend on Hadoop (standalone)	22
4.5	Memory Management	24
5	Data Manipulation	26
5.1	Creating H2O Frames	26
5.1.1	Convert from RDD, DataFrame or Dataset	26
5.1.2	Creating H2OFrame from an Existing Key	27
5.1.3	Create H2O Frame Directly	27
5.2	Converting H2O Frames to Spark entities	28
5.2.1	Convert to RDD	28
5.2.2	Convert to DataFrame	28
5.3	Mapping between H2OFrame And Data Frame Types	29
5.4	Mapping between H2OFrame and RDD[T] Types	30
5.5	Using Spark Data Sources with H2OFrame	30
5.5.1	Reading from H2OFrame	30
5.5.2	Saving to H2OFrame	31
5.5.3	Specifying Saving Mode	32
6	Calling H2O Algorithms	33
7	Productionizing MOJOs from H2O-3	37
7.1	Loading the H2O-3 MOJOs	37
7.2	Exporting the loaded MOJO model using Sparkling Water	41
7.3	Importing the previously exported MOJO model from Sparkling Water	41
7.4	Accessing additional prediction details	41
7.5	Customizing the MOJO Settings	41
7.6	Methods available on MOJO Model	42
7.6.1	Obtaining Domain Values	42
7.6.2	Obtaining Model Category	42
7.6.3	Obtaining Feature Types	42
7.6.4	Obtaining Feature Importances	43
7.6.5	Obtaining Scoring History	43
7.6.6	Obtaining Training Params	43
7.6.7	Obtaining Metrics	43
7.6.8	Obtaining Leaf Node Assignments	44
7.6.9	Obtaining Stage Probabilities	44
8	Productionizing MOJOs from Driverless AI	44
8.1	Requirements	45
8.2	Loading and Score the MOJO	45
8.3	Predictions Format	48
8.4	Customizing the MOJO Settings	48
8.5	Troubleshooting	49
9	Deployment	50
9.1	Referencing Sparkling Water	50
9.1.1	Using Assembly Jar	50
9.1.2	Using PySparkling Zip	51
9.1.3	Using the Spark Package	51
9.2	Target Deployment Environments	52
9.2.1	Local cluster	52
9.2.2	On a Standalone Cluster	52
9.2.3	On a YARN Cluster	53
9.3	DataBricks Cloud	53
9.3.1	Creating a Cluster	54
9.3.2	Running Sparkling Water	54
9.3.3	Running PySparkling	55
9.3.4	Running RSparkling	56
10	Running Sparkling Water in Kubernetes	57
10.1	Internal Backend	57
10.1.1	Scala	58
10.1.2	Python	60
10.1.3	R	62
10.2	Manual Mode of External Backend	63
10.2.1	Scala	63
10.2.2	Python	66
10.2.3	R	68
10.3	Automatic Mode of External Backend	70
10.3.1	Scala	70
10.3.2	Python	72
10.3.3	R	75
11	Sparkling Water Conﬁguration Properties	77
11.1	Conﬁguration Properties Independent of Selected Backend	77
11.2	Internal Backend Conﬁguration Properties	83
11.3	External Backend Conﬁguration Properties	85
12	Building a Standalone Application	88
13	A Use Case Example	90
13.1	Predicting Arrival Delay in Minutes – Regression	90
14	FAQ	93
15	References	98

To read the eBook, click the download link above.

Generative AI

Predictive AI

On-Premise Platform

Managed Cloud

Hybrid Cloud

Industry Solutions

Use Cases

H2O.ai Hospital Occupancy Simulator

Strategic Transformation

View All Case Studies

FINANCIAL SERVICES

TELECOM

ENERGY

MARKETING

Partners

Resources

Open Source

Join H2O University

Support

Events

H2O.ai Wiki

Responsible AI

Company

Submit AI 100 2025 Nomination

2025 Gartner® Magic Quadrant™

H2O AI 100 2024

Machine Learning with Sparkling Water: H2O + Spark

Contents

Ready to see the H2O.ai platform in action?

Why H2O.ai

Products

Resources

Insights