I’ve emphasized using R squared a fair amount on the blog, and I didn’t think it would be a big deal. But apparently it’s more controversial than I thought. So this post is about why you should always check your regression’s R squared before worrying about the significance of the variables in the regression.
First we have something you might remember from your intro to regression class or the stats book you taught yourself from: in a simple regression (one variable of interest, one predictor), the significance test for the predictor is the same thing as the significance test for the entire regression. And this makes sense; you only have one predictor, so if it isn’t significant, the regression shouldn’t be either because then you only have an intercept left. But it also means we can talk about R squared and the test of the regression and it means the same thing as talking about R squared and the test of the predictor.
There’s a straightforward equation that connects the F value of the regression (which you would compare to some critical F value to check for significance) to R squared. It’s F = R* (N-p-1) / p*(1-R). R there just means R squared (for simplicity), N is the sample size, and p is the number of predictors. Since we’re talking about simple regression, p is always 1. So we can break the equation into two parts: R/(1-R) and N-2. You multiply those to get your F value and check to see if it’s significant. For example, let’s say you ran a regression that had 50 data points (N=50) and an R squared of .5. You have .5/(1-.5) * (50-2) = 48. You would check a table in your book (or get the value out of your stats software) and see that to be significant at the usual 5% level, the F would need to be 4.04 or greater. Since we’re above that, our regression is significant, which means the predictor is significant. The R squared is .5 which means X explains half of the variance in Y, or tells us about half of why people (or teams, or whatever) vary in their Y values. If Y is weight and X is height and the R squared were .5, we would know that half of why people have different weights is due to them having different heights.
So why did I mention breaking F into two parts? Because one is related to how much explanatory power you have (the R/(1-R) part), and the other is just sample size. If R squared is very tiny, like close to 0, then the F will generally be close to 0 and you will not have a significant regression (with an exception coming up in a minute). If it’s large, you will tend to have a large F; say R squared is .9, then that half of F is .9/(1-.9) = 9. If you multiply by 9, you don’t need much of a sample size to get a significant result. But what happens if you have a lot of data? Say you’re looking at data for each team in the league since the Bobcats joined in 2004. You have six complete seasons of data for 30 teams, or 180 data points. The critical F value for a regression with N=180 and one predictor is 3.89. We know that the N-2 part is 178. That means to get a significant result (after a little algebra), you only need an R squared of about .02 or better. That’s a pretty low bar; your predictor only needs to explain 2% of the variance in your variable of interest. Here’s a graph that shows the R squared necessary for a significant regression at different sample sizes:
You can see that with a low sample size (N=5), you need a fairly high R squared of .77 to get a significant regression (and thus predictor). But as the sample size goes up, the R squared drops very quickly.
So why am I emphasizing R squared? Because it tells you how much of the variance you are explaining; in other words, it tells you how good the predictions you make will be. Let’s look at three different situations, one where the data is pretty clean, one where it’s noisier, and one where it’s very noisy. I created some random numbers to be the predictor variable (X) then created our variable of interest (Y) from the formula Y = 2*X +5 + noise. The bigger the standard deviation of the noise, the more noise there is. The graphs below show you the data for each level of noise along with the line that best fits that data; the R squared and regression equation are in the title.
You can see the effect of having noisy data; the y values become more and more spread out relative to the regression line. The regression equations change a bit, but each of them have a significant predictor and contain the actual slope value (2) in their 95% confidence intervals (the high noise in the third case actually drove the mean of Y down that far away from 5). Besides just looking at the spread of the data around the regression line, we can quantify how good the predictions made by each model are. The typical comparison is between the model you fit (e.g. the regression’s predicted values) and a model that uses only the intercept; that is, you predict the mean of Y every time. Then you could find the residual, which is how wrong each prediction is. For example, in the low noise data the lowest X value is -2.795. The actual Y value that goes with that X is -.507. The regression predicts -.631, so the residual is .124. Just using the intercept, we would guess the mean of Y which is 4.63, which is obviously much farther away from the correct value. If we square all the residuals, add them up, and divide by 100 to get the mean squared error, we get .0767 for the low noise regression; it’s usually pretty accurate. The mean squared error for the intercept is 3.86, so we’ve improved our predictions (by this measure, at least) by about 4900% by using X to guess Y. In the medium noise condition, our predictions improve by about 110%. In the high noise condition, however, our predictions only improve by about 7.5%. As the noise increases, we gain less and less by using X to make predictions compared to simply guessing the average of Y. The more those two predictions are the same, the less useful X is. But remember, in each case the coefficient for X was appropriate (not significantly different from the true value of 2) and significant (in the high noise case it had a p value of .0078 even though there were only 100 data points).
There are two things you should consider when you evaluate an analysis you ran. One is if you got statistically significant results. As I noted above, this becomes more and more likely simply by having more data. The smallest data set I can envision in sports research would be across teams, where you would only have 30 data points. But if you look across players in a season you’ll likely have at least 400 points (depending on any minute cut-offs you might use); if you look across line-ups that play together in a season you’ll have at least 200 (again depending on cut-offs). The sample sizes are usually fairly big and the significance thresholds thus relatively low.
The other thing you need to evaluate is how much your variable matters. Do you really know much more about Y if you know what X is? I showed above that noise in Y can lower the usefulness of X, but it also comes simply from the connection between X and Y. If you’re trying to predict people’s weights, it would be much better to use X=their height than X=their salary. The R squared tells you how well-connected X is to Y. If the R squared is very low, meaning that X isn’t connected to Y, it’s a sign that using X to predict Y simply isn’t buying you much; you might as well use the mean of Y for every guess. Unlike significance tests, there aren’t thresholds for R squared to tell you if your value is ‘good enough’. In physics, which has precise measurements in controlled environments, R squared values are typically over .9 and sometimes virtually 1. In psychology or sports, where even if measurements are precise the environment is noisy and there are often other influences that could affect how X and Y are connected, R squared values of .5 or .3 might be good. But R squared has a fixed range, from 0 to 1 (it’s a proportion; in percentage terms you can explain 0% of Y up to 100% of Y). And I think everyone can agree that there simply isn’t a lot going on if you can only explain, say, 5% of Y. If that’s the case and you have a significant predictor, especially if you have a good amount of data, you have to temper your enthusiasm about any conclusions you want to draw.