## Further Thoughts on Significance and Regression

Phil Birnbaum put up a post yesterday discussing the use (misuse?) of .05 as the p value indicating a significant result.  P values have come up on my blog before, notably in the discussion of how important usage is, so I thought it would be good to look at Phil’s article and expand a bit.

Selecting an alpha value: As Phil notes, .05 has become the consensus alpha value for significance testing.  However, this is only true in the social sciences; alpha values are much smaller in medical testing, for example.  Why?  Because the cost of a false positive is much higher.  If you think that a drug has a particular effect due to a statistical difference (such as between the drug and placebo groups) and you’re wrong, that’s a big mistake.  On the other hand, if I claim that some conditions in my psychology experiment produce different levels of accuracy on a memory test and I’m wrong, no one gets hurt.  Of course there’s a cost to potentially setting back the progress of science, but we’ll assume that doesn’t cost any lives.  It is also only broadly true that social science uses .05.  People who use fMRI or other brain imaging techniques virtually never use .05 because there are so many tests involved in analyzing the data.  So there is nothing special about .05, and people are welcome to use any alpha value they want to determine significance.  All you need to do is justify your choice to your audience.

Multiple tests: There are a lot of different kinds of statistical tests.  T-tests, ANOVAs, regressions, etc.  The clearest example of multiple testing would be t-tests.  Say I have three groups and measure the IQs for a number of people in each group.  If I want to compare the groups to each other, I would compare A to B, A to C, and B to C.  If I run each of them with an alpha of .05, the chances that one of those three tests will turn out significant is actually higher than .05; I controlled the error rate for the individual tests but not the group of tests (or the familywise rate).  To put the familywise rate at the level I want (it could still be .05), I need to lower the individual rates.  The wikipedia article that Phil links to has some options, like the Bonferroni correction.  The need for corrections is very clear with t-tests because there are obviously a number of tests, but it is still true for ANOVA and regression.  It is easy to forget that when you run an ANOVA or regression, you are really testing one thing: the significance of the ANOVA/regression, also called the omnibus F test.  You are testing if *any* of your independent variables has an effect on your dependent variable.  However, you cannot tell *which* variable(s) has an effect.  To do that you need to do follow-up tests.  In an ANOVA the follow-ups are generally in the form of t-tests but in a regression they are generally tests of individual beta weights (also t-tests, but not necessarily of the same form).  Each of these is a test and you may (should) want to correct your alpha value to decide what is significant and what isn’t.  Unfortunately, most statistical packages give you these other tests for free and even show you if they’re significant or not at .05 and so thinking about what the tests mean and how to interpret them is generally thrown out the window.

Your data: As I talked about in the R squared discussion, sample size has an important effect on p values.  The bigger your sample size, the smaller the effect you will determine to be significant.  This is why you should always consider your effect size as well as the significance of any result you’re interested in.

Determining importance: The last section there is critical.  Just because some variable is significant does not mean that it’s important.  For example, I argued that usage was not necessarily an important variable in predicting player efficiency; yes, across a big group of players you may predict their efficiency better if you knew their usage.  However, you don’t predict much better; the R squared (which you could use to determine an effect size, which is one statistical measure of ‘importance’) is small.  You might also determine importance by comparing standardized beta coefficients in a multiple regression or looking at elasticity.  In short, you need to use your head.

Phil’s batting example: Phil described an example where you see if hitting is better on a particular day of the week.  You could test this with ANOVA or regression and they would be functionally identical, but I’ll assume regression since it’s so popular.  Your dependent variable would be batting average, or whatever you prefer, and the independent would be day of the week.  Assuming dummy coding, your stats package would implicitly turn this into 6 dummy variables, with the seventh day being folded into the intercept; let’s say it’s Sunday.  Since each day is independent, there is no multicollinearity issue (note that I don’t mean that batting ability on any day is independent from batting on another; I mean that the variable ‘Monday’ is independent from the variables for every other day).