June 7th, 2013

BIG VS. LITTLE: P-Values and Coefficients

RSS icon RSS Category: Uncategorized [EN]
Fallback Featured Image

The Quick and Dirty:

For the moment let’s assume that we have some a priori hypothesis, and we want to test. We can talk about two things: how big the relationship is and how strong it is. P-values don’t care about big – they only care about strong.
To get a sense for this recall from ANOVA the fairly common test statistic F. We decide whether or not there is reason to believe that our data reflect a true underlying relationship between the variables of interest based on whether the F statistic generated by our data falls on one side or another of the rejection threshold corresponding with a critical F.
We have some chosen level α (normally .05 by convention). In the simplest sense, F depends on the number of conditions we’re testing, the number of observations, and the variance in the dependent variable relative to the treatment variable of interest. Regardless of whether you are working from a logit regression or a gaussian, most statistical softwares, including R will give you a P-value. The P-value gives you information about the independent variable, and is unique to each independent variable in your model. You can see it here in the output from R below in the far right column.

glm(formula = HrsSlp ~ +HrsTVR)


Estimate     Std. Error  t value  Pr(> t )
(Intercept)   1.7483       1.2957  1.349   0.19037
HrsTVR         0.4615      0.1449   3.186   0.00412

The P value is the exact probability of the observed test statistic; it allows us to skip over entirely the cumbersome calculation of a threshold statistic and states directly the chance that if all of your assumptions hold and if you repeat sampling (or your experiment) in the same way that you will observe a test statistic more extreme (with a smaller P-value) than the one you have now.  It doesn’t tell you the probability that your hypothesis is true.
In the output above I ran a really quick little glm on Hours Of Sleep as a function of Hours of TV. Neglect everything else, and note that the P-value associated with HrsTVR is  0.00412 – not surprising, considering that the two variables came from a set of numbers that I cooked up to play around with (and so are related by design). We reject our tacit assumption that the two variables we’re interested in aren’t related IF P < α.

Moving along

Let’s assume that the two variables we’re interested in aren’t really that related. In the example, I’ve chosen the Number of Hours of Sleep and the Number of Hours of TV Watched (by the guy three houses away).
I operationalized this in R as: “`glm(formula = HrsSlp ~ +HrsTVNR)“` for n = 25 observations and gotten the following feedback:

Estimate   Std. Error   t value   Pr(> t )
(Intercept)      5.3218      0.7607    6.996  3.95e-07
HrsTVNR       0.1454      0.2674    0.544  0.592

Not surprising – the P-value tells us that if we were to take some different sample of both of these, we would  get a totally different estimate for how much some guy’s TV habits impact hours of sleep.  But what if the two are really related? What if we just by chance randomly sampled 25 nights that are totally not characteristic?
Below are the P-values for increasingly large samples of observed nights sleep and TV watching.  Not surprisingly, the p-values get smaller as the data get larger. Were we to continue on this way we would eventually get to the point where we have such a large number of observations, the significance can be taken for granted; we will have not only passed the threshold P < α, but we will have passed it long ago.
50 0.353
150      0.220
300 0.152 (the estimated coefficient is 0.08578).
At this point you should be bothered; we have two totally unrelated things, and earlier I said that according to the P-value we could say with relative confidence that one variable didn’t have much to do with the other. We’re good, though; as the P-value is demonstrating increasing confidence, the estimated coefficient is going toward zero, which is exactly what should happen. As n increases, our conclusion that X and Y aren’t related is based less in the relative variance of the two quantities, and more in our increasing certainty that an increasingly narrow window around something close to 0 as the estimated parameter correctly captures the true relationship.

Leave a Reply

H2O Wave joins Hacktoberfest

It’s that time of the year again. A great initiative by DigitalOcean called Hacktoberfest that aims to bring

September 29, 2022 - by Martin Turoci
Three Keys to Ethical Artificial Intelligence in Your Organization

There’s certainly been no shortage of examples of AI gone bad over the past few

September 23, 2022 - by H2O.ai Team
Using GraphQL, HTTPX, and asyncio in H2O Wave

Today, I would like to cover the most basic use case for H2O Wave, which is

September 21, 2022 - by Martin Turoci
머신러닝 자동화 솔루션 H2O Driveless AI를 이용한 뇌에서의 성차 예측

Predicting Gender Differences in the Brain Using Machine Learning Automation Solution H2O Driverless AI 아동기 뇌인지

August 29, 2022 - by H2O.ai Team
Make with H2O.ai Recap: Validation Scheme Best Practices

Data Scientist and Kaggle Grandmaster, Dmitry Gordeev, presented at the Make with H2O.ai session on

August 23, 2022 - by Blair Averett
Integrating VSCode editor into H2O Wave

Let’s have a look at how to provide our users with a truly amazing experience

August 18, 2022 - by Martin Turoci

Start Your Free Trial