After some of the discussion about R squared and sample size in my other post and its comments, I thought I should take a better look at it. I’m glad I did, because I was reminded of something I should have known but didn’t get 100% right (I should say I haven’t checked my comments since this afternoon, so apologies if someone already caught this). The question at hand is how does R squared react with increasing sample size.
First, a quick reminder about R squared. In a simple regression (e.g., one predictor variable and one variable to be predicted), R squared is literally the correlation coefficient squared. In multiple regression, R squared is the squared correlation of the actual dependent variable values (the thing being predicted) and the predicted values from the regression equation. In either case, R squared tells you how much of the proportion of the variance in the dependent variable your independent variable(s) explains. Thus it can range from 0 to 1; you can explain nothing about the variable, or everything, or any proportion in between.
So, how does sample size influence r squared? What I should have remembered, and the graphs below will illustrate, is that a larger sample size gives you a better estimate of the ‘real world’ correlation. Correlations are, after all, subject to random error, and and larger sample reduces that error and makes your estimate move closer to the true value. My simulation worked like this. I looked at sample sizes ranging from 2 to 900, 30 of them in total. For each sample size, I created a data sample that contained two variables with a correlation of .75. I ran a regression on that sample predicting one variable from the other and collected the R squared. I created 10 samples for each sample size, so in the end I have 300 samples along with their sample size and R squared. If I make a scatterplot of those two things, I get the graph below.
The line on the graph is set at the ‘true’ R squared, which is .75^2. What you can see is that with small sample sizes the R squared is very noisy. The reason I started with sample size=2 is because you have to have at least two points to calculate a correlation, but with only two you will always have a perfect relationship (two points connect a line). With random noise it’s possible the correlation could be -1 or +1, but either way the R squared will always be 1. As the sample size increases from 2 to about 100, you notice that the band of dots narrows, sort of like a funnel on its side. This is because as sample size increases, you’re reducing random noise and getting a better indication of the true relationship between the two variables. From about 200 out to 900, the band is roughly the same size. There’s still variability at each sample size value, but it’s due to noise in the sample (e.g. the random samples are drawn independently and may have values a little different from the .75 correlation I specified).
That exercise isn’t exactly like what happens if you look at wins and salary in the NBA, which is what prompted the discussion. In that case you’re more likely to look at overlapping samples than completely independent samples. The simulation above, for example, might be appropriate if you looked at the relationship for the years 2008-2010 versus only 1995; you’ll get a better estimate for 2008-2010 because there’s a larger sample, but those samples are also independent. I think it’s more likely that people would look at the previous season (say 2010) or the past few seasons (say 2008-2010). In that case the sample aren’t independent because they both include some of the same data. To simulate this kind of analysis, I created one data set with 900 points (again, two variables with a correlation of .75). It turns out that the particular sample I got had a correlation more like .76. Then I ran a regression predicting one variable from the other using different samples sizes from those 900 observations. I started with the first two observations, then the first three, first four, etc up to 900. I again collected the R squared for each model and plotted it against the amount of data (the sample size) used in the regression.
The line on the graph is again the true value, this time the sample correlation instead of the one specified in creating the sample. You can see that again with two observations the R squared is 1, as it must be, but then it drops dramatically and quickly heads up towards .58 (which is the .76 sample correlation squared). So the first 50 observations or so were fairly noisy, but with the full sample of those 50 observations we get a decent estimate of the true relationship. But then the R squared drops again, before coming back up to .58 and more or less staying there from sample size 250-ish on. So after the first 50 observations, it turned out that the next 200 or so muddied the water (were noisy) enough to give us bad estimates. But with great sample size comes great estimates, and the regression is spot-on (in this particular sample) with a sample size of around 700.
I should note that I’m not saying that you need 200 or 700 observations to get an accurate R squared, that’s just what happened in these simulations. But in either case (overlapping or independent samples), more is obviously better, and increasing your sample size will lead you asymptotically to the true value. This must be the case; if you had the ability to measure the entire population you were interested in without error, you would have to find its true R squared (and correlation, mean, variance, etc.).
Returning to the NBA salary issue, I made a mistake in my response to Phil. I agreed with him that a larger sample should lead to a larger R squared. That will only be true if the sample this year happened to give you a value below the true one. So in Phil’s old post, if the value was .256, we have to assume that’s the true value. Fortunately there’s more than one season of data available, so we can improve that estimate but there’s no particular reason to assume it will go up.
Alternatively, there could be a reason to think the R squared value will increase. I (and others, including Phil) have argued that salary is really standing in for other variables, like team quality. To the extent that players are paid appropriate for their ability, salary will serve as an indicator of team quality, and thus wins. The low R squared we actually see indicates that players are not being paid completely appropriately. But, if player evaluation is getting better with time, the connection between salary and wins should increase, and thus the R squared will increase. So if, for example, it turned out that the wins-salary R squared from 2005-2010 (or going forward into the future) is higher than from 2000-2005, that might be an indication that player evaluation is getting better across the league.