It’s going to be quiet around here for a little while. I’m moving (twice) and defending my thesis in the upcoming month, and I don’t know how often I’ll be able to get a post up. But I saw a couple of things that inspired me to do a little more stats education. Today: correlation.

I’ve talked about correlation before (for example). It’s a one-number indication of the linear association between two variables; if both tend to go up or down together (like height and weight) you get a number closer to 1; if one tends to go down while the other goes up (like weight and miles run), you get a number closer to -1; if they move fairly independently of each other (like weight and social security number), you get a number closer to 0. The symmetry of correlation is obvious; since you only get one number from two variables, it can’t matter which is which (in the usual stats parlance of X and Y, either height or weight could be X or Y and everything would turn out the same). If you’ve taken a stats class you hopefully heard ‘correlation does not mean causation’, and the symmetry should make that clear. In an example from a class I took, let’s say you found a positive correlation between number of storks seen and babies born in that area. Does that mean storks bring babies? No; you have just as much evidence that babies bring storks. You may have other evidence or theories that suggest the causality runs one way versus the other, but the correlation won’t help you out.

Correlation is the same thing as a simple regression. So everything you know about correlation applies to regression, like interpretation of causation, and everything you know about regression applies to correlation, like missing variables and linear association (if two variables have more of a U-shaped association, neither a regression nor correlation will describe it well). But perhaps the biggest favor you can do a reader is to clearly describe what you’re correlating; after that, they should be able to follow what you’ve done and interpret your conclusions correctly. When I first read this post I was excited because I was curious if 4th quarter scores really were more important to winning a game than first quarter scores. But I was confused; the correlation between overtime differential and final differential is 0? The correlation should be near 1; if you outscore your opponent by 4 in overtime, your final differential must also be 4 because the game would be over. The only reason the correlation wouldn’t be 1 is if the game were still tied at the end of the first overtime. However, that isn’t exactly what he’s looking at; instead he’s looking at first quarter (and second, third, etc) differentials over the course of the whole season and how they correlate to a team’s total differential over the whole season. In that case the correlations are more of an indication of how each quarter’s performance tells you about a team’s overall quality. The small correlation for overtime makes sense since overtime is essentially a coinflip; good and bad teams have roughly equal chances of winning if they get to overtime.

A different post, at the APBR board, also led to confusion because of a poor description of what was being correlated. The Weak Side Awareness post has an unclear title but presents the original data so that you can figure out what was being correlated. Mike G says that he correlated free throw attempts for players (total over the season) with other boxscore stats. However, minutes has nearly a 0 correlation and some other stats have a negative correlation. How could that be? Playing more minutes means the player is on the court and can accumulate more of all stats; free throw attempts should have a positive correlation with virtually all box score stats. In player data I have for 2009-2010 (two seasons), the correlation between free throw attempts and all other box score stats is positive; the only negative correlations in the whole data set are between three pointers made or attempted and either offensive rebounds or blocks, and even they are only -.09 (a fairly weak correlation). For example, the correlation between free throw attempts and minutes played is .804; the correlation between free throw attempts and turnovers is .861; and the correlation between minutes and turnovers is .894.

Instead, in a much later post after multiple attempts at interpreting the correlations, Mike gave a still unclear description that makes it sound like he performed more of a multiple regression than a correlation. Multiple regression is still related to correlation, but in this case it’s a semi-partial correlation. A semi-partial correlation is the correlation between a variable (say, turnovers) and another variable (say, free throw attempts) after a third variable (minutes) has been partialled out of one of the variables. ‘Partialled out’ means that you run a regression predicting one variable from another, and use the residuals. In this case, you would predict free throw attempts from minutes played. Minutes does a good job of explaining free throw attempts, but doesn’t explain it perfectly. The residuals contain the information in free throw attempts that is not explained by minutes played. The semi-partial correlation would be the correlation between turnovers and those residuals; we’re looking at what turnovers tells us about free throw attempts that is not already explained by minutes played. The semi-partial correlation between turnovers and free throw attempts, partialling minutes out of free throw attempts, is only .239, a far cry from the .861 we saw earlier. If you were to partial minutes out of both free throw attempts and turnovers, you would be calculating the partial correlation. The partial correlation between free throw attempts and turnovers is .535.

What do these correlations tell you? The semi-partial correlation tells you that if you adjust minutes out of free throw attempts, there is still a positive correlation with turnovers. However, it is much smaller than the ‘naive’ correlation. The effect of minutes is still in turnovers; if we partial minutes out of both , the correlation remains positive and increases to .535. This makes sense to me; players who have the ball a lot are the ones able to accumulate both turnovers and free throw attempts. In essence, the partial and semi-partial correlations are ways of looking at the association between two variables after other ones have been controlled for (you can partial out as many variables as you want; of course, you can still only look at the correlation between two variables in the end). As a comparison, the correlation between turnovers per minute and free throw attempts per minute (another way to account for minutes played) is .335.

Each of these numbers are based on different aspects of the connection between free throw attempts, turnovers, and minutes played, but all indicate that the association between free throw attempts and turnovers is positive after accounting for minutes played, although not nearly so strong as is suggested by the ‘normal’ correlation. You could also run a multiple correlation predicting free throw attempts from turnovers and minutes played (either scaling the variables or not) and you would see a positive coefficient for turnovers, again indicating that more turnovers means more free throw attempts.

With all that being said, these correlations are not perfect. There is still the issue of missing variables and multicollinearity, just as there would be if you had only run a multiple regression. But as long as you’re clear in describing what you’ve done and everyone understands (or you describe) the limitations of your technique, everyone should come out ahead.

“if one tends to go down while the other goes up (like weight and calorie intake)”

You sure about that? I’m going about shedding those pounds the difficult way!

Yeah, my bad. More calories = more weight. How about weight and number of celery sticks consumed? Assuming that everything I’ve heard is true about it having fewer calories in it than it takes to chew it …

> But I was confused; the correlation between overtime differential and final differential is 0?

> The correlation should be near 1

So in your opinion what’s missing in this explanation:

“Correlation Coefficient between Each Quarter Differential and Final Total Differential”?

To be clear…

> When I first read {this post} I was excited because I was curious if 4th quarter scores

> really were more important to winning a game than first quarter scores

… I should have mentioned that I didn’t sum scores after each quarter?

I thought the individual data points would be games, not teams. So team X wins the first quarter by 4 points and they go on to win by 6 would be one point; team X wins the first quarter by 6 in a different game that they go on to lose by 1, and so on. Instead of 30 final differential data points per season you would have 2460. I didn’t mean to imply that your post had done anything wrong, just that sometimes it’s hard to tell what exactly people are doing from descriptions, especially when you have your own thoughts on how you would do it given the data.

“Instead of 30 final differential data points per season you would have 2460″

Um, but that’s exactly what I’ve done – each of those 3 seasons have 2460 data points…

Again, if it isn’t obvious, how can I fix that?

And what did you expect here: “I was excited because I was curious if 4th quarter scores really were more important to winning a game than first quarter scores”?

Sum of point margin after every quarter?

BTW, I’ve edited aforementioned post, in your opinion is it better now?

I think that’s closer to what I was thinking of, thanks!

I’m still a little unclear on how you generated your original table though. Maybe there’s a game we could use as a reference; what numbers would you use from game 6 of this year’s Mavs-Heat Finals? My original thought was that each data point for a quarter would be the margin of victory in that quarter, regardless of what happened in the other quarters, and the final margin of victory in that game. The Finals game would generate the numbers (for Dallas) 5, -3, 7, 1, and final margin 10. If there were a game that went to overtime the margins might be 4, -5, 6, -5, 2; the final margin would be 2. The margins would not be summed over previous quarters or over teams. That’s why overtime must be near 1; games must be tied to go to overtime and so the margin in overtime will nearly always be the same as the final margin. So I don’t think that’s what you did in your original table. After seeing your later tables I assumed that you had just added up to the team level and then run the correlation, as in Chicago generated the 1st overtime point -6, 600.

The points in the new table you posted, using the Mavs-Heat, would be 5, 2, 9, 10; does that sound right? Each quarter is a running sum, and that would explain why the correlations gets closer to 1 across quarters (because the running sum is getting closer to its final value, and every non-overtime game will have a fourth quarter margin equal to the final margin).

Pingback: 3-Year Data of Team’s Point Differential by Quarters « Weak Side Awareness

“I’m still a little unclear on how you generated your original table though. [...] The Finals game would generate the numbers (for Dallas) 5, -3, 7, 1, and final margin 10. If there were a game that went to overtime the margins might be 4, -5, 6, -5, 2; the final margin would be 2. The margins would not be summed over previous quarters or over teams. ”

That’s exactly how I did it in my original table… but I’ve noticed one little detail… I have a zero as a margin for every overtime [even in games without overtimes]… so that’s probably a reason why you’ve expected something else there.

“After seeing your later tables I assumed that you had just added up to the team level and then run the correlation, as in Chicago generated the 1st overtime point -6, 600.”

No, those are simply additional tables for teams because I’m pretty sure such discussion would bring topic “but team X was awesome in 4th quarters and they were great…”

“The points in the new table you posted, using the Mavs-Heat, would be 5, 2, 9, 10; does that sound right?”

Yes… but again with overtime issue [so those 10s are also across 3 overtimes].

I see. So all those 0s are diluting the actual overtime correlations a lot. That’s interesting data; thanks for posting it!

“So all those 0s are diluting the actual overtime correlations a lot”

I though it was simply a common denominator… and for league-wide results overtimes felt insignificant because there were very few of them but you are right. It’s fixed. Thanks!

Pingback: Wages of Wins Network Bullets « Wages of Wins Journal