Phil Birnbaum put up a post yesterday discussing the use (misuse?) of .05 as the p value indicating a significant result. P values have come up on my blog before, notably in the discussion of how important usage is, so I thought it would be good to look at Phil’s article and expand a bit.
Selecting an alpha value: As Phil notes, .05 has become the consensus alpha value for significance testing. However, this is only true in the social sciences; alpha values are much smaller in medical testing, for example. Why? Because the cost of a false positive is much higher. If you think that a drug has a particular effect due to a statistical difference (such as between the drug and placebo groups) and you’re wrong, that’s a big mistake. On the other hand, if I claim that some conditions in my psychology experiment produce different levels of accuracy on a memory test and I’m wrong, no one gets hurt. Of course there’s a cost to potentially setting back the progress of science, but we’ll assume that doesn’t cost any lives. It is also only broadly true that social science uses .05. People who use fMRI or other brain imaging techniques virtually never use .05 because there are so many tests involved in analyzing the data. So there is nothing special about .05, and people are welcome to use any alpha value they want to determine significance. All you need to do is justify your choice to your audience.
Multiple tests: There are a lot of different kinds of statistical tests. T-tests, ANOVAs, regressions, etc. The clearest example of multiple testing would be t-tests. Say I have three groups and measure the IQs for a number of people in each group. If I want to compare the groups to each other, I would compare A to B, A to C, and B to C. If I run each of them with an alpha of .05, the chances that one of those three tests will turn out significant is actually higher than .05; I controlled the error rate for the individual tests but not the group of tests (or the familywise rate). To put the familywise rate at the level I want (it could still be .05), I need to lower the individual rates. The wikipedia article that Phil links to has some options, like the Bonferroni correction. The need for corrections is very clear with t-tests because there are obviously a number of tests, but it is still true for ANOVA and regression. It is easy to forget that when you run an ANOVA or regression, you are really testing one thing: the significance of the ANOVA/regression, also called the omnibus F test. You are testing if *any* of your independent variables has an effect on your dependent variable. However, you cannot tell *which* variable(s) has an effect. To do that you need to do follow-up tests. In an ANOVA the follow-ups are generally in the form of t-tests but in a regression they are generally tests of individual beta weights (also t-tests, but not necessarily of the same form). Each of these is a test and you may (should) want to correct your alpha value to decide what is significant and what isn’t. Unfortunately, most statistical packages give you these other tests for free and even show you if they’re significant or not at .05 and so thinking about what the tests mean and how to interpret them is generally thrown out the window.
Multicollinearity: We’ve talked about this before. In short, if your predictors correlate with each other as well as the dependent variable, you have a multicollinearity issue. When you have multicollinearity you can have all sorts of problems like beta weights jumping around depending on what other variables are in your regression and inaccurate errors for those weights. Multicollinearity has nothing to do with the alpha value you would choose to indicate significance, but it does change the p value of your test and thus could change your decision to claim some result was significant or not. Phil’s article has an example of adding or removing variables to change your p values. This only applies when multicollinearity is in play; if all of your predictors were independent of each other, you could add and remove variables as much as you want and their p values will not change.
Your data: As I talked about in the R squared discussion, sample size has an important effect on p values. The bigger your sample size, the smaller the effect you will determine to be significant. This is why you should always consider your effect size as well as the significance of any result you’re interested in.
Determining importance: The last section there is critical. Just because some variable is significant does not mean that it’s important. For example, I argued that usage was not necessarily an important variable in predicting player efficiency; yes, across a big group of players you may predict their efficiency better if you knew their usage. However, you don’t predict much better; the R squared (which you could use to determine an effect size, which is one statistical measure of ‘importance’) is small. You might also determine importance by comparing standardized beta coefficients in a multiple regression or looking at elasticity. In short, you need to use your head.
Phil’s batting example: Phil described an example where you see if hitting is better on a particular day of the week. You could test this with ANOVA or regression and they would be functionally identical, but I’ll assume regression since it’s so popular. Your dependent variable would be batting average, or whatever you prefer, and the independent would be day of the week. Assuming dummy coding, your stats package would implicitly turn this into 6 dummy variables, with the seventh day being folded into the intercept; let’s say it’s Sunday. Since each day is independent, there is no multicollinearity issue (note that I don’t mean that batting ability on any day is independent from batting on another; I mean that the variable ‘Monday’ is independent from the variables for every other day).
The test you are running (whether you knew it or not) is if any of the six other days are different from Sunday. A significant omnibus F test, or test of the multiple R squared, tells you that *at least* one of the other days is different from Sunday. That is the correct interpretation of your regression. If you then look at the p values for each day, those are 6 tests comparing each individual day to Sunday, and you might want to correct your alpha value for multiple comparisons. In order to do what Phil describes, in terms of changing your reference day to Monday, you would use a different dummy coding such that Monday gets folded into the intercept. Now all of your beta weights and tests compare the other days to Monday instead of Sunday. But there is a critical piece here: your omnibus F test will not change. That is, regardless of what coding you use, you will always reach the same conclusion about whether day of the week affects batting. If you were to do as Phil describes and change Thursday from not significant to significant it would be a flaw in your interpretation or description of the data, not in your use of regression or alpha values per se. In Phil’s description, Thursday is different from Monday but not Sunday. If you cycle through each day as the reference you are not changing the importance of Thursday, but what you are comparing Thursday to. Reporting your p value or changing your alpha value will not change this. As a side note, you can build comparisons into your regression if you want, such as weekdays versus weekends or (as Phil would prefer) individual days to an average day. Do a google search for effect coding or contrast coding and I’m sure you’ll find something useful. Using a coding besides dummy coding will give you different references and comparisons, but again will not change your omnibus F test if you’ve done things correctly.
In general, the value of p values and significance are greatly overstated. Unfortunately, they are still the standard for academic publishing; if you don’t have significant effects, you don’t have much to talk about. But, like many things, this is due more to tradition and lack of knowledge than objective accuracy. To that extent, lowering your alpha value is not going to help a whole lot. A thorough understanding of your data and the tests you ran is vastly more important.