July 9th, 2013

The MillionSongs Data Part 1: Bells and Whistles of GLM in H2O

RSS icon RSS Category: Uncategorized [EN]
Fallback Featured Image

Using the Million Songs Data Set I want to go from beginning to end through H2O's GLM tool. Note that the original data are large, so downloading and fiddling with the full data set can be quite painful if you just do it from your desktop, that said you can find it here.  It’s a good opportunity to take a really detailed look at H2O so that you can get the most bump from the trunk (so to speak).

To start, let’s assume you’ve decided that GLM is the method for you. You’ve launched H2O, parsed your data and chosen GLM from the drop down menu under “Model”.
Destination Key  – this is an automatically generated key for your model; it will allow you to recall this specific model and all of its details later in your analysis. While H2O will spit a key out for you, you can also specify a model name such that later you can identify which of many models you are interested in revisiting.
Key – this is the .hex key generated when you parsed your data into H2O. If you didn’t save it at the time it’s no biggie. The .hex is named whatever your original data file was named, save for the change in extension. If you begin typing the name of your original file, you will be given the option to tab auto-complete. If you want to find the key yourself you can do so by going to the drop down menu “Admin”, select “Jobs” and under description find “Parse”. The key for your data of interest is given in the “Destination key” field, and is a clickable link that allows you to inspect your data.
Y – Your dependent variable.
X – Once you identify your dependent variable (the value you would like to predict) in the Y field, the X field will auto populate with all possible options (all of your other variables).  You select the subset of variables that you would like to use to predict with.
Family – Under family you will see a drop down menu with choices. Each of the four options differs in the assumptions you make about your dependent (Y) variable – the variable you would like to predict. They are explained in some detail below.
Link – Each family is associated with a default link function, which defines the specialized transformation on the set of X variables chosen to predict Y.

Family Default Link Description and Example
Gaussian Identity Your dependent variables (Y) are quantitative, continuous (or continuous predicted values can be meaningfully interpreted), and expected to be normally distributed.EX: The average length of a song in seconds or the average purchase price of a product.
Binomial Logit Your dependent variables take on two values, traditionally coded as 0 and 1, and follow a binomial distribution. Choose this if you have a categorical Y with two possible outcomes.EX: Customer decides to purchase or notA song is played or not played
Poisson Log Your dependent variable is a count – a quantitative, discrete value that expresses the number of times some event occurred.EX: The number of customers visiting a website over time, the number of customers visiting a store over distance
Gamma Inverse Your dependent variable is a survival measure – that is, you have some measure of the duration of a process for which the outcome is variable.EX: The length of time an individual remains a customer, the length of time before a particular product feature fails

Lambda: H2O provides a default value, but this can also be user defined. Lambda is a regularization parameter that is designed to prevent overfitting. The best value of lambda depends on the degree to which you wish the variance of the cross validated coefficients to match.
Alpha:   A user defined tuning regularization parameter that H2O sets to 0.5 by default, but which can take any value between 0 and 1, inclusive.  It functions so that there is an added penalty taken against the estimated fit of the model as the number of parameters increases. An alpha of 1 is the lasso penalty, and an alpha of 0 is the ridge penalty.
Lambda and alpha are distinct in purpose in that lambda is primarily concerned with preventing overfitting and thus increasing the generalizability of any specific coefficient in your model, where alpha is concerned with the model overall. 
N-Folds: The number of cross validations you would like H2O to generate. Choosing 10 means that ten random samples of observations from your orginal data will be selected and models will be fit to those subsets as well. It’s important to note that the smaller your orginal data are the larger the variation you can expect to see in the parameter estimates provided in the cross validation models; for sufficiently small data sets you may want to choose a different evaluative criteria.
Expert Settings: for the moment I would like to leave expert settings, except to note that this is the option you choose if you would like to standardize your data. In data where there is a substantial difference in the scale of your input variables standardizing can greatly improve the interpretability of your results.

Leave a Reply

+
Enhancing H2O Model Validation App with h2oGPT Integration

As machine learning practitioners, we’re always on the lookout for innovative ways to streamline and

May 17, 2023 - by Parul Pandey
+
Building a Manufacturing Product Defect Classification Model and Application using H2O Hydrogen Torch, H2O MLOps, and H2O Wave

Primary Authors: Nishaanthini Gnanavel and Genevieve Richards Effective product quality control is of utmost importance in

May 15, 2023 - by Shivam Bansal
AI for Good hackathon
+
Insights from AI for Good Hackathon: Using Machine Learning to Tackle Pollution

At H2O.ai, we believe technology can be a force for good, and we're committed to

May 10, 2023 - by Parul Pandey and Shivam Bansal
H2O democratizing LLMs
+
Democratization of LLMs

Every organization needs to own its GPT as simply as we need to own our

May 8, 2023 - by Sri Ambati
h2oGPT blog header
+
Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey

At H2O.ai, we pride ourselves on developing world-class Machine Learning, Deep Learning, and AI platforms.

May 3, 2023 - by Arno Candel
LLM blog header
+
Effortless Fine-Tuning of Large Language Models with Open-Source H2O LLM Studio

While the pace at which Large Language Models (LLMs) have been driving breakthroughs is remarkable,

May 1, 2023 - by Parul Pandey

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More