Machine Learning with Sparkling Water: H2O + Spark
January 2023: Fifth Edition
Contents
| Section | Title | Page |
|---|---|---|
| 1 | What is H2O? | 6 |
| 2 | Sparkling Water Introduction | 8 |
| 2.1 | Typical Use Cases | 8 |
| 2.1.1 | Model Building | 8 |
| 2.1.2 | Data Munging | 9 |
| 2.1.3 | Stream Processing | 9 |
| 2.2 | Features | 11 |
| 2.3 | Supported Data Sources | 11 |
| 2.4 | Supported Data Formats | 11 |
| 2.5 | Supported Spark Execution Environments | 12 |
| 2.6 | Sparkling Water Clients | 12 |
| 2.7 | Sparkling Water Requirements | 13 |
| 3 | Design | 14 |
| 3.1 | Data Sharing between H2O and Spark | 15 |
| 3.2 | H2OContext | 15 |
| 4 | Starting Sparkling Water | 17 |
| 4.1 | Setting Up The Environment | 17 |
| 4.2 | Starting Interactive Shell with Sparkling Water | 17 |
| 4.4 | Starting Sparkling Water with Internal Backend | 18 |
| 4.4 | External Backend | 19 |
| 4.4.1 | Automatic Mode of External Backend | 19 |
| 4.4.2 | Manual Mode of External Backend on Hadoop | 21 |
| 4.4.3 | Manual Mode of External Backend on Hadoop (standalone) | 22 |
| 4.5 | Memory Management | 24 |
| 5 | Data Manipulation | 26 |
| 5.1 | Creating H2O Frames | 26 |
| 5.1.1 | Convert from RDD, DataFrame or Dataset | 26 |
| 5.1.2 | Creating H2OFrame from an Existing Key | 27 |
| 5.1.3 | Create H2O Frame Directly | 27 |
| 5.2 | Converting H2O Frames to Spark entities | 28 |
| 5.2.1 | Convert to RDD | 28 |
| 5.2.2 | Convert to DataFrame | 28 |
| 5.3 | Mapping between H2OFrame And Data Frame Types | 29 |
| 5.4 | Mapping between H2OFrame and RDD[T] Types | 30 |
| 5.5 | Using Spark Data Sources with H2OFrame | 30 |
| 5.5.1 | Reading from H2OFrame | 30 |
| 5.5.2 | Saving to H2OFrame | 31 |
| 5.5.3 | Specifying Saving Mode | 32 |
| 6 | Calling H2O Algorithms | 33 |
| 7 | Productionizing MOJOs from H2O-3 | 37 |
| 7.1 | Loading the H2O-3 MOJOs | 37 |
| 7.2 | Exporting the loaded MOJO model using Sparkling Water | 41 |
| 7.3 | Importing the previously exported MOJO model from Sparkling Water | 41 |
| 7.4 | Accessing additional prediction details | 41 |
| 7.5 | Customizing the MOJO Settings | 41 |
| 7.6 | Methods available on MOJO Model | 42 |
| 7.6.1 | Obtaining Domain Values | 42 |
| 7.6.2 | Obtaining Model Category | 42 |
| 7.6.3 | Obtaining Feature Types | 42 |
| 7.6.4 | Obtaining Feature Importances | 43 |
| 7.6.5 | Obtaining Scoring History | 43 |
| 7.6.6 | Obtaining Training Params | 43 |
| 7.6.7 | Obtaining Metrics | 43 |
| 7.6.8 | Obtaining Leaf Node Assignments | 44 |
| 7.6.9 | Obtaining Stage Probabilities | 44 |
| 8 | Productionizing MOJOs from Driverless AI | 44 |
| 8.1 | Requirements | 45 |
| 8.2 | Loading and Score the MOJO | 45 |
| 8.3 | Predictions Format | 48 |
| 8.4 | Customizing the MOJO Settings | 48 |
| 8.5 | Troubleshooting | 49 |
| 9 | Deployment | 50 |
| 9.1 | Referencing Sparkling Water | 50 |
| 9.1.1 | Using Assembly Jar | 50 |
| 9.1.2 | Using PySparkling Zip | 51 |
| 9.1.3 | Using the Spark Package | 51 |
| 9.2 | Target Deployment Environments | 52 |
| 9.2.1 | Local cluster | 52 |
| 9.2.2 | On a Standalone Cluster | 52 |
| 9.2.3 | On a YARN Cluster | 53 |
| 9.3 | DataBricks Cloud | 53 |
| 9.3.1 | Creating a Cluster | 54 |
| 9.3.2 | Running Sparkling Water | 54 |
| 9.3.3 | Running PySparkling | 55 |
| 9.3.4 | Running RSparkling | 56 |
| 10 | Running Sparkling Water in Kubernetes | 57 |
| 10.1 | Internal Backend | 57 |
| 10.1.1 | Scala | 58 |
| 10.1.2 | Python | 60 |
| 10.1.3 | R | 62 |
| 10.2 | Manual Mode of External Backend | 63 |
| 10.2.1 | Scala | 63 |
| 10.2.2 | Python | 66 |
| 10.2.3 | R | 68 |
| 10.3 | Automatic Mode of External Backend | 70 |
| 10.3.1 | Scala | 70 |
| 10.3.2 | Python | 72 |
| 10.3.3 | R | 75 |
| 11 | Sparkling Water Configuration Properties | 77 |
| 11.1 | Configuration Properties Independent of Selected Backend | 77 |
| 11.2 | Internal Backend Configuration Properties | 83 |
| 11.3 | External Backend Configuration Properties | 85 |
| 12 | Building a Standalone Application | 88 |
| 13 | A Use Case Example | 90 |
| 13.1 | Predicting Arrival Delay in Minutes – Regression | 90 |
| 14 | FAQ | 93 |
| 15 | References | 98 |
To read the eBook, click the download link above.