Generalized Linear Modeling with H2O
May 2020: Seventh Edition
Contents
Section | Title | Page |
---|---|---|
1 | Introduction | 6 |
2 | What is H2O? | 6 |
3 | Installation | 7 |
3.1 | Installation in R | 7 |
3.2 | Installation in Python | 8 |
3.3 | Pointing to a Different H2O Cluster | 9 |
3.4 | Example Code | 9 |
3.5 | Citation | 10 |
4 | Generalized Linear Models | 10 |
4.1 | Model Components | 10 |
4.2 | GLM in H2O | 11 |
4.3 | Model Fitting | 13 |
4.4 | Model Validation | 13 |
4.5 | Regularization | 14 |
4.5.1 | Lasso and Ridge Regression | 14 |
4.5.2 | Elastic Net Penalty | 15 |
4.6 | GLM Model Families | 15 |
4.6.1 | Linear Regression (Gaussian Family) | 15 |
4.6.2 | Logistic Regression (Binomial Family) | 17 |
4.6.3 | Fractional Logit Model (Fraction Binomial) | 19 |
4.6.4 | Logistic Ordinal Regression (Ordinal Family) | 20 |
4.6.5 | Multi-class classification (Multinomial Family) | 23 |
4.6.6 | Poisson Models | 24 |
4.6.7 | Gamma Models | 26 |
4.6.8 | Tweedie Models | 27 |
4.6.9 | Negative Binomial Models | 30 |
4.7 | Hierarchical GLM | 32 |
4.7.1 | Gaussian Family and Random Family in HGLM | 33 |
4.7.2 | H2O Implementation | 34 |
4.7.3 | Fixed and Random Coefficients Estimation | 35 |
4.7.4 | Estimation of Fixed Effect Dispersion Parameter/Variance | 35 |
4.7.5 | Estimation of Random Effect Dispersion Parameter/-Variance | 35 |
4.7.6 | Fitting Algorithm Overview | 35 |
4.7.7 | Linear Mixed Model with Correlated Random Effect | 36 |
4.7.8 | HGLM Model Metrics | 37 |
4.7.9 | Mapping of Fitting Algorithm to the H2O-3 Implementation | 38 |
5 | Building GLM Models in H2O | 38 |
5.1 | Classification and Regression | 38 |
5.2 | Training and Validation Frames | 39 |
5.3 | Predictor and Response Variables | 39 |
5.3.1 | Categorical Variables | 39 |
5.4 | Family and Link | 40 |
5.5 | Regularization Parameters | 40 |
5.5.1 | Alpha and Lambda | 40 |
5.5.2 | Lambda Search | 40 |
5.6 | Solver Selection | 43 |
5.6.1 | Solver Details | 43 |
5.6.2 | Stopping Criteria | 44 |
5.7 | Advanced Features | 46 |
5.7.1 | Standardizing Data | 46 |
5.7.2 | Auto-remove collinear columns | 46 |
5.7.3 | P-Values | 47 |
5.7.4 | K-fold Cross-Validation | 47 |
5.7.5 | Grid Search Over Alpha | 49 |
5.7.6 | Grid Search Over Lambda | 50 |
5.7.7 | Offsets | 52 |
5.7.8 | Row Weights | 52 |
5.7.9 | Coefficient Constraints | 52 |
5.7.10 | Proximal Operators | 53 |
6 | GLM Model Output | 53 |
6.1 | Coefficients and Normalized Coefficients | 56 |
6.2 | Model Statistics | 57 |
6.3 | Confusion Matrix | 59 |
6.4 | Scoring History | 59 |
7 | Making Predictions | 60 |
7.1 | Batch In-H2O Predictions | 60 |
7.2 | Low-latency Predictions using POJOs | 63 |
8 | Best Practices | 64 |
8.1 | Verifying Model Results | 65 |
9 | Implementation Details | 66 |
9.1 | Categorical Variables | 67 |
9.1.1 | Largest Categorical Speed Optimization | 67 |
9.2 | Performance Characteristics | 67 |
9.2.1 | IRLSM Solver | 67 |
9.2.2 | L-BFGS solver | 68 |
9.3 | FAQ | 69 |
10 | Appendix: Parameters | 69 |
11 | Acknowledgments | 73 |
12 | References | 73 |
13 | Authors | 74 |
To read the eBook, click the download link above.