Hey everyone – sorry for abandoning the blog for a few days there. I caught whatever’s been going around, and was spending some quality time with my couch and less time with everything else. But, right after this I’ll get to responding to the comments I know I have piled up.

I’m also sorry to say that I’m going to be putting the blog on hiatus for a month or so. Things have been very busy at work and I’ll be traveling a fair amount in April anyway, and I need to refocus my energies elsewhere. But don’t go deleting your RSS feed or anything quite yet, because even if I don’t break down and post anything in the meantime I’ll definitely be back with my usual NHL and NBA playoff predictions.

(EDIT: I immediately regretted my decision after reading Bill Simmons’ article where he uses Ricky Rubio’s 33rd-best-point-guard PER rating as a reason to dislike advanced stats instead of as a reason to dislike PER, and then he referred to Chris Paul as a guy “who will never, ever be traded” THE SAME YEAR HE JUST GOT TRADED!)

On the plus side, this gives me an opportunity to say thanks to everyone for reading, which is certainly part of the drive to keep getting posts up. My colleagues may not appreciate it, but I do! And under the jump here I have one last bit of statistical insight before I disappear.

Following on discussion over at Phil Birnbaum’s blog, I’m going to look at the role that sample size plays in correlations. Correlations are run extremely often, and if regression is more your thing then hopefully you know that the R squared value is just the square of the correlation between the actual outcome (Y) observations and the predicted outcomes according to the regression model. So if you ran a regression and came out with a low R squared, you can take it to mean that the combination of predictors you used have little correlation with the variable you care about. In other words, the correspondence is noisy. Dave Berri looked at the correlation between a quarterback’s performance (using a few different measures) in one year and the next and found it to be relatively low. The ‘relatively’ is in comparison to basketball players, who show fairly high correlations in their statistics (even when adjusted to rates so that minutes played isn’t a factor) from one season to the next. Phil had a variety of issues with that particular study, but in his last post he argues that the QB correlation is lower because of a lower sample size compared to basketball players.

One way that you could mean sample size is the number of observations that go into your correlation. For example, you could have 400 basketball players and those 400 dots in your scatterplot versus 100 quarterbacks. I’m going to skip over this one quickly because a) it isn’t what Phil meant and b) I covered it already. The quick answer is that more observations gives you a more precise estimate of the ‘true’ correlation, but the correlation itself doesn’t get any bigger or smaller per se with increasing sample size. The value of the correlation is just a bit noisy and can jump around the true value. However, if you look at my second graph in that older post, it does seem like the correlation is mostly hanging out below the true value. To demonstrate that was a fluke, here is the same graph but I ran a new sample. Below that is the same graph but with 100 runs all superimposed on each other. You see that the correlation can just as easily be too high at low sample size, or too low. But as you get a larger sample, you hone in on the actual value. You can only really see it at the far left and right sides of the graph, but I added a line at correlation=.75 (the true value) as a guide. And if you count the number of dots above .75, it’s essentially 50%, as it should be.

So what did Phil actually mean? By sample size, he was talking about how many observations go into a particular dot in our scatterplot. For example, with quarterbacks maybe you would correlate their yards per pass in year 1 with their yards per pass in year 2. If you have 100 QBs, you would have 100 dots. But each dot, or each QB’s yards per pass, is based on some number of passes. In the NBA, each player’s dot for field goal accuracy is based on some number of shots. Phil argued that since NBA players take many more shots than NFL QB’s take passes, the correlation nearly always (often?) should be lower for QB’s.

And I owe Phil an apology for disagreeing in his comments; he’s absolutely correct. Take an extreme example and say that NBA players, or QBs, got to take a billion shots (or passes) every year. Their field goal percentage, or passer rating or whatever measure you like, would at that point basically be a perfect indicator of their true ability. And if true ability was constant from year to year, then the correlation would be 1. When players don’t get that many opportunities, the number we see in a given year is a less precise indicator of their true ability, and across two seasons players could move past each other based on random noise. Now, with any one opportunity (by which I mean, a chance to see N players throw passes in two seasons), the correlation could be roughly anywhere. But on average, the correlation will be around 0 with low passes per year and increase towards 1 with more passes per year.

However, that isn’t the whole story. There are at least two sources of noise that will make the correlation less than one. One is change in talent across seasons; players will age (for better or worse), get injured, change systems, change teammates, etc. So a 45% shooter this year may not be a 45% shooter next year. Second is how much any given act is connected to the player’s actual talent. Suppose we were using adjusted net yards per attempt (ANYA) to evaluate QBs. The actual NFL range of quality goes from something like 3.5 to 7 ANYA according to Brian Burke. But any one pass could contribute from -45 (an interception) to 119 (a 99 yard touchdown pass). Thus the range of any given outcome for a QB is gigantically wider than the actual spread in talent for QBs. Both of these sources of noise are going to lower the correlation we see from year to year, although only the first one (changes in actual talent) can make the correlation drop below 0 with infinite opportunities to pass.

To see how these things can trade off against each other, I wrote some code that simulates two seasons for N players where they each get to take X passes. Those two seasons can happen a bunch of different times so that the possible range of outcomes can be seen. Feel free to play with it to see how much the correlation can change depending on the spread in talent, how talent changes from year to year, and how any given pass corresponds to true talent. The simulation is, obviously, agnostic as to what exactly is happening, so it works equally well for NBA shooters or pretty much anything as long as you think talent is normally distributed and the expression of talent could be described as normally distributed (shooting percentages can’t be, since they’re bounded between 0 and 1, but you could pretend you standardized them or something similar). But, what you’ll find is that you can change a few different options and change the average year-to-year correlation.

So Phil is right that more opportunities will lead to a larger year-to-year correlation. But I can’t let him completely off the hook because it isn’t clear-cut to me that Phil’s claim that the NFL/NBA difference is ‘mostly’ sample size is true; it’s certainly part of the difference, but it’s hard to say how much because we don’t know how the leagues differ on any of the other moving parts of the correlation. For example, free throw percentage is more consistent than field goal percentage. But there are fewer free throws taken per year than field goal attempts; the pattern goes completely opposite to Phil’s claim. So there are other things going on besides sample size, but how much is talent distribution, how much is year-to-year change, how much is the noise in a single shot? I’m not sure that’s definitively knowable, and the same issue applies to QB performance versus field goal attempts.

This code makes the second graph from above, in case anyone was curious:

library(MASS)

samplesize=900

correl=.75

means=c(0,0)

covar=matrix(c(1,correl,correl,1),2,2)

allcorrel=NULL

allsample=NULL

counter=1

for (b in 1:200) {

data=mvrnorm(samplesize,means,covar)

for (a in 3:samplesize) {

x=cor.test(data[1:a,1],data[1:a,2])

allcorrel[counter]=x$est

allsample[counter]=a

counter=counter+1 } }

plot(allsample,allcorrel)

This code will do the year-to-year correlation with different opportunities per player. I’ll try to comment things as well as I can, but hopefully it’s understandable anyway. I wouldn’t copy this straight into R if the formatting looks odd, and you can delete the stuff after the # if you want (those are just the comments). Note that as opposed to the code above, there’s no inherent ‘correct’ amount of correlation unless you remove the noise from the ‘talent2’ line, in which case the correct answer is 1; with enough passes, the correlation should always be 1 if talent doesn’t change from year to year.

samplesize=100 #how many QBs/players

reps=25 #how many pass per year/opportunities per dot

correlations=NULL #will hold year-to-year correlations

for (c in 1:500) { #look at 500 year-to-year samples

talent=rnorm(samplesize,0,1) #talent is normally distributed. make 1 larger for a wider distribution

talent2=talent+rnorm(samplesize,0,.5) #how much talent changes year-to-year. make it bigger by increasing the .5 or take it away by multiplying rnorm by 0

year1=NULL #holds year 1 outcomes

year2=NULL #holds year 2 outcomes

for (b in 1:samplesize) {

temp1=0 #will track year 1 passes

temp2=0 #will track year 2 passes

for (a in 1:reps) {

pass1=talent[b]+rnorm(1,0,30) #each pass is an expression of talent level. note I have noise set very high here

pass2=talent2[b]+rnorm(1,0,30) #same but for year two

temp1=temp1+pass1 #sum up over passes to get a total for single year

temp2=temp2+pass2 }

year1[b]=temp1/reps #outcome for one player in one year is average of outcome of all his passes

year2[b]=temp2/reps }

x=cor.test(year1,year2) #runs the year-to-year correlation across all players

correlations[c]=x$est } #stores the correlations for the 500 runs

hist(correlations) #makes a histogram of the correlation outcomes.

max(correlations) #these print the maximum, minimum, and mean just to be clear

min(correlations)

mean(correlations)

Just some example results: with the values above, I get a mean correlation of .03 and a range from -.26 to .32. If you just double the number of players (samplesize=200), the mean correlation is, as promised, unchanged but the range shrinks a bit (-.16 to .24). If instead you double the reps to 50 the correlation moves up to .05 and the range didn’t change a lot. If you double again to 100 the correlation moves up to .1 and the range shifts up to -.19 to .37. If I keep the reps there and increase the spread of talent by changing the 1 to a 3, the correlation jumps up to .5 with a range of .25 to .7 (if players are more spread out, there’s less of a chance they’ll randomly move past each other). Keeping reps at 100 but cutting the year-to-year noise down from .5 to .25 didn’t do much (correlation stayed at .1). And finally if you change the single-pass outcome noise from 30 to 15 the correlation jumps to .3 and the range is from .01 to .57.

Let’s hear it for common ground! Glad we’re (mostly) all on the same page.

However, I disagree with this statement: “it isn’t clear-cut to me that Phil’s claim that the NFL/NBA difference is ‘mostly’ sample size is true; it’s certainly part of the difference, but it’s hard to say how much because we don’t know how the leagues differ on any of the other moving parts of the correlation.” To a significant degree this is knowable. If ability doesn’t change, correlation is a function of the spread in talent and binomial variance. I forget the formula, but maybe Phil can tell us — perhaps it’s r = 1 – SD(error)/SD(true)? And we know the sample sizes for these stats, so that gives us SD(error). And by subtracting the variance(error) from the observed SD, we can determine the SD(true). So if someone wants to, they can calculate the r^2 each stat will have if talent is unchanged. Then you can see if these differences are nearly as large as those Berri observes — if so, we know that sample size and talent spread alone are producing the illusion of more/less “consistent” play. And the differences between actual and projected R^2 are a measure of the changes in true talent.

For example, I’m sure we would find the high R^2 for FT% is largely explained by the very large spread in skill, which must be much larger than the spread in FG%. There is probably some real difference too, since distance/difficulty of shots will vary, but I’d guess talent spread explains most of it. (You also do have to be careful on FG%, sincethat can be impacted by number of 3P attempts. Probably better to look at that separately on 2- and 3-point attempts.)

R-squared = 1 – var(error)/var(true).

R is the square root of that.

I might have time tonight to download some stats and estimate the r^2 for skill in FT% …

OK, here’s what I got. I took the top 50 (only) in FTA for 2004-05. The mean number of attempts was 471.

The variance of overall FT% among those 50 players was .0875^2.

Using binomial, the variance of luck over 471 attempts is .0199^2.

So, variance(talent) = .0875^2 – .0199^2 = .0852^2.

Because .0852^2/.0875^2 = .948, the r-squared for talent vs. outcomes is .948. If you’re comparing season to season, you have to square that (I’m pretty sure), which gives you .899.

An r-squared of .899 equals an r of .948.

That’s what you should see, from season to season, if you look at the top 50 only. Caveats:

1. Using the mean of 471 for all players is a shortcut. I’d have to think about the “real” way to do it. If you use more players, with a wider variance of attempts, I suspect the mean becomes less accurate.

2. This assumes these same 50 players will also have 471 attempts next year. If you try it, I’d actually limit the players to those who had 400 both seasons. It’s selective sampling, but since there isn’t much selection against bad foul shooters, you’ll probably be OK.

3. The more players you use, the higher the binomial variance compared to the talent variance. And so, your correlation should drop. If every player had only 1/4 the number of attempts, the binomial variance would be multiplied by 4. Instead of .0199^2, you’d have .0398^2. That would reduce the r-squared to .82. Not a huge reduction, but that’s because the talent spread is wide relative to the attempts.

The effect would be higher in baseball, because the spread of talent is tighter. In the NBA, some guys shoot 80% and some guys shoot 50%. In MLB, some guys hit 21% and some guys hit 33%. Not as big a spread in MLB, and, I would guess, tighter around the average.

OK, I ran real-life numbers for the top 50 FTA players in 2004-05, and those same players in 2005-06 (except Amare Stoudemire, who didn’t play much in 05-06).

The correlation for FT% was .819. I had predicted it should be .948.

Big difference. Why? Talent changes.

Suppose talent stayed the same from year to year. You’d expect only 1 in 20 players to change more than 2 SDs of binomial from year to year. (Actually, that might be a bit higher, because the second year FTAs for most players is smaller than the first year. But never mind that for now.) And even fewer players more than 3 SD away: maybe 1 in 100.

But there were more than that. Yao Ming, Paul Pierce, Mike Bibby, Dwight Howard, Jalen Walker, Pau Gasol, Antoine Walker. And, Drew Gooden dropped from .810 to .682, in about the same number of attempts (310 one year, 258 the next). That’s almost 4 SDs. Tyson Chandler was above 4 SDs.

Any ideas what’s going on with these guys?

Gooden’s change, for one, is impressively large. But the confidence intervals for the two seasons are not as far apart as you might guess: someone who produces 251 hits out of 310 (81%) attempts has a 95% confidence interval of 76% to 85% and 176 out of 258 (68.2%)has an interval of 62% to 74%. Binomial noise alone could get those numbers fairly close.

Right. But there were way too many extreme values to be just binomial noise. There should have been 2 or 3 out of 50 bigger than 2 SD: there were 9. There should have been close to zero at 4 SD, but there were 2. So something else is going on.

That list of guys from (Yao Ming to Antoine Walker) was 2 SD or more, by the way. I didn’t make it clear in the post.

Desmond Mason was also more than 4 SDs.

“then he referred to Chris Paul as a guy “who will never, ever be traded” THE SAME YEAR HE JUST GOT TRADED!”

I think he meant “right now from the Clippers” which makes sense.

BTW, I had the same reaction to Rubio’s comment 😉