It’s going to be quiet around here for a little while. I’m moving (twice) and defending my thesis in the upcoming month, and I don’t know how often I’ll be able to get a post up. But I saw a couple of things that inspired me to do a little more stats education. Today: correlation.
I’ve talked about correlation before (for example). It’s a one-number indication of the linear association between two variables; if both tend to go up or down together (like height and weight) you get a number closer to 1; if one tends to go down while the other goes up (like weight and miles run), you get a number closer to -1; if they move fairly independently of each other (like weight and social security number), you get a number closer to 0. The symmetry of correlation is obvious; since you only get one number from two variables, it can’t matter which is which (in the usual stats parlance of X and Y, either height or weight could be X or Y and everything would turn out the same). If you’ve taken a stats class you hopefully heard ‘correlation does not mean causation’, and the symmetry should make that clear. In an example from a class I took, let’s say you found a positive correlation between number of storks seen and babies born in that area. Does that mean storks bring babies? No; you have just as much evidence that babies bring storks. You may have other evidence or theories that suggest the causality runs one way versus the other, but the correlation won’t help you out.
Correlation is the same thing as a simple regression. So everything you know about correlation applies to regression, like interpretation of causation, and everything you know about regression applies to correlation, like missing variables and linear association (if two variables have more of a U-shaped association, neither a regression nor correlation will describe it well). But perhaps the biggest favor you can do a reader is to clearly describe what you’re correlating; after that, they should be able to follow what you’ve done and interpret your conclusions correctly. When I first read this post I was excited because I was curious if 4th quarter scores really were more important to winning a game than first quarter scores. But I was confused; the correlation between overtime differential and final differential is 0? The correlation should be near 1; if you outscore your opponent by 4 in overtime, your final differential must also be 4 because the game would be over. The only reason the correlation wouldn’t be 1 is if the game were still tied at the end of the first overtime. However, that isn’t exactly what he’s looking at; instead he’s looking at first quarter (and second, third, etc) differentials over the course of the whole season and how they correlate to a team’s total differential over the whole season. In that case the correlations are more of an indication of how each quarter’s performance tells you about a team’s overall quality. The small correlation for overtime makes sense since overtime is essentially a coinflip; good and bad teams have roughly equal chances of winning if they get to overtime.
A different post, at the APBR board, also led to confusion because of a poor description of what was being correlated. The Weak Side Awareness post has an unclear title but presents the original data so that you can figure out what was being correlated. Mike G says that he correlated free throw attempts for players (total over the season) with other boxscore stats. However, minutes has nearly a 0 correlation and some other stats have a negative correlation. How could that be? Playing more minutes means the player is on the court and can accumulate more of all stats; free throw attempts should have a positive correlation with virtually all box score stats. In player data I have for 2009-2010 (two seasons), the correlation between free throw attempts and all other box score stats is positive; the only negative correlations in the whole data set are between three pointers made or attempted and either offensive rebounds or blocks, and even they are only -.09 (a fairly weak correlation). For example, the correlation between free throw attempts and minutes played is .804; the correlation between free throw attempts and turnovers is .861; and the correlation between minutes and turnovers is .894.
Instead, in a much later post after multiple attempts at interpreting the correlations, Mike gave a still unclear description that makes it sound like he performed more of a multiple regression than a correlation. Multiple regression is still related to correlation, but in this case it’s a semi-partial correlation. A semi-partial correlation is the correlation between a variable (say, turnovers) and another variable (say, free throw attempts) after a third variable (minutes) has been partialled out of one of the variables. ‘Partialled out’ means that you run a regression predicting one variable from another, and use the residuals. In this case, you would predict free throw attempts from minutes played. Minutes does a good job of explaining free throw attempts, but doesn’t explain it perfectly. The residuals contain the information in free throw attempts that is not explained by minutes played. The semi-partial correlation would be the correlation between turnovers and those residuals; we’re looking at what turnovers tells us about free throw attempts that is not already explained by minutes played. The semi-partial correlation between turnovers and free throw attempts, partialling minutes out of free throw attempts, is only .239, a far cry from the .861 we saw earlier. If you were to partial minutes out of both free throw attempts and turnovers, you would be calculating the partial correlation. The partial correlation between free throw attempts and turnovers is .535.
What do these correlations tell you? The semi-partial correlation tells you that if you adjust minutes out of free throw attempts, there is still a positive correlation with turnovers. However, it is much smaller than the ‘naive’ correlation. The effect of minutes is still in turnovers; if we partial minutes out of both , the correlation remains positive and increases to .535. This makes sense to me; players who have the ball a lot are the ones able to accumulate both turnovers and free throw attempts. In essence, the partial and semi-partial correlations are ways of looking at the association between two variables after other ones have been controlled for (you can partial out as many variables as you want; of course, you can still only look at the correlation between two variables in the end). As a comparison, the correlation between turnovers per minute and free throw attempts per minute (another way to account for minutes played) is .335.
Each of these numbers are based on different aspects of the connection between free throw attempts, turnovers, and minutes played, but all indicate that the association between free throw attempts and turnovers is positive after accounting for minutes played, although not nearly so strong as is suggested by the ‘normal’ correlation. You could also run a multiple correlation predicting free throw attempts from turnovers and minutes played (either scaling the variables or not) and you would see a positive coefficient for turnovers, again indicating that more turnovers means more free throw attempts.
With all that being said, these correlations are not perfect. There is still the issue of missing variables and multicollinearity, just as there would be if you had only run a multiple regression. But as long as you’re clear in describing what you’ve done and everyone understands (or you describe) the limitations of your technique, everyone should come out ahead.