Machine Learning with Sparkling Water: H2O + Spark
January 2023: Fifth Edition
Contents
Section | Title | Page |
---|---|---|
1 | What is H2O? | 6 |
2 | Sparkling Water Introduction | 8 |
2.1 | Typical Use Cases | 8 |
2.1.1 | Model Building | 8 |
2.1.2 | Data Munging | 9 |
2.1.3 | Stream Processing | 9 |
2.2 | Features | 11 |
2.3 | Supported Data Sources | 11 |
2.4 | Supported Data Formats | 11 |
2.5 | Supported Spark Execution Environments | 12 |
2.6 | Sparkling Water Clients | 12 |
2.7 | Sparkling Water Requirements | 13 |
3 | Design | 14 |
3.1 | Data Sharing between H2O and Spark | 15 |
3.2 | H2OContext | 15 |
4 | Starting Sparkling Water | 17 |
4.1 | Setting Up The Environment | 17 |
4.2 | Starting Interactive Shell with Sparkling Water | 17 |
4.4 | Starting Sparkling Water with Internal Backend | 18 |
4.4 | External Backend | 19 |
4.4.1 | Automatic Mode of External Backend | 19 |
4.4.2 | Manual Mode of External Backend on Hadoop | 21 |
4.4.3 | Manual Mode of External Backend on Hadoop (standalone) | 22 |
4.5 | Memory Management | 24 |
5 | Data Manipulation | 26 |
5.1 | Creating H2O Frames | 26 |
5.1.1 | Convert from RDD, DataFrame or Dataset | 26 |
5.1.2 | Creating H2OFrame from an Existing Key | 27 |
5.1.3 | Create H2O Frame Directly | 27 |
5.2 | Converting H2O Frames to Spark entities | 28 |
5.2.1 | Convert to RDD | 28 |
5.2.2 | Convert to DataFrame | 28 |
5.3 | Mapping between H2OFrame And Data Frame Types | 29 |
5.4 | Mapping between H2OFrame and RDD[T] Types | 30 |
5.5 | Using Spark Data Sources with H2OFrame | 30 |
5.5.1 | Reading from H2OFrame | 30 |
5.5.2 | Saving to H2OFrame | 31 |
5.5.3 | Specifying Saving Mode | 32 |
6 | Calling H2O Algorithms | 33 |
7 | Productionizing MOJOs from H2O-3 | 37 |
7.1 | Loading the H2O-3 MOJOs | 37 |
7.2 | Exporting the loaded MOJO model using Sparkling Water | 41 |
7.3 | Importing the previously exported MOJO model from Sparkling Water | 41 |
7.4 | Accessing additional prediction details | 41 |
7.5 | Customizing the MOJO Settings | 41 |
7.6 | Methods available on MOJO Model | 42 |
7.6.1 | Obtaining Domain Values | 42 |
7.6.2 | Obtaining Model Category | 42 |
7.6.3 | Obtaining Feature Types | 42 |
7.6.4 | Obtaining Feature Importances | 43 |
7.6.5 | Obtaining Scoring History | 43 |
7.6.6 | Obtaining Training Params | 43 |
7.6.7 | Obtaining Metrics | 43 |
7.6.8 | Obtaining Leaf Node Assignments | 44 |
7.6.9 | Obtaining Stage Probabilities | 44 |
8 | Productionizing MOJOs from Driverless AI | 44 |
8.1 | Requirements | 45 |
8.2 | Loading and Score the MOJO | 45 |
8.3 | Predictions Format | 48 |
8.4 | Customizing the MOJO Settings | 48 |
8.5 | Troubleshooting | 49 |
9 | Deployment | 50 |
9.1 | Referencing Sparkling Water | 50 |
9.1.1 | Using Assembly Jar | 50 |
9.1.2 | Using PySparkling Zip | 51 |
9.1.3 | Using the Spark Package | 51 |
9.2 | Target Deployment Environments | 52 |
9.2.1 | Local cluster | 52 |
9.2.2 | On a Standalone Cluster | 52 |
9.2.3 | On a YARN Cluster | 53 |
9.3 | DataBricks Cloud | 53 |
9.3.1 | Creating a Cluster | 54 |
9.3.2 | Running Sparkling Water | 54 |
9.3.3 | Running PySparkling | 55 |
9.3.4 | Running RSparkling | 56 |
10 | Running Sparkling Water in Kubernetes | 57 |
10.1 | Internal Backend | 57 |
10.1.1 | Scala | 58 |
10.1.2 | Python | 60 |
10.1.3 | R | 62 |
10.2 | Manual Mode of External Backend | 63 |
10.2.1 | Scala | 63 |
10.2.2 | Python | 66 |
10.2.3 | R | 68 |
10.3 | Automatic Mode of External Backend | 70 |
10.3.1 | Scala | 70 |
10.3.2 | Python | 72 |
10.3.3 | R | 75 |
11 | Sparkling Water Configuration Properties | 77 |
11.1 | Configuration Properties Independent of Selected Backend | 77 |
11.2 | Internal Backend Configuration Properties | 83 |
11.3 | External Backend Configuration Properties | 85 |
12 | Building a Standalone Application | 88 |
13 | A Use Case Example | 90 |
13.1 | Predicting Arrival Delay in Minutes – Regression | 90 |
14 | FAQ | 93 |
15 | References | 98 |
To read the eBook, click the download link above.