Phil Birnbaum at Sabremetric Research has a new post up about the FAQ over at Wages of Wins. I don’t have the time to talk about it in detail because a guy’s got to sleep sometime, but I wanted to get something up. As it happens, I disagreed with one of Phil’s posts in the past and wrote a post on an old blog. I’ve copied it below with only minor editing; keep in mind I wrote it in early fall ’09. I don’t read Birnbaum’s blog very often because I’m not interested in baseball, but on the occasions I have read it I always come away with one impression: the man is not to be trusted. At the very least, his grasp on the meaning of R squared is tenuous.
I read something on the internet and it was wrong. Advanced NFL Stats, which I enjoy and usually agree with, posted a round-up of recent articles he liked. One of them was this, about how R squared is not a useful measure. And it’s mostly wrong. In essence, it’s a response to a post by one of the authors of Wages of Wins (which I’ve read, liked, and agreed with) in which he claims that there is little relationship between wins and salary in any of the major sports. The evidence is that while there is a positive relationship between the two, the R squared value isn’t very high. Phil Birnbaum says that the real question is, does spending more money lead to more wins? And Phil believes that when you run the regression, in the NBA for example, and find the result wins = .61*salary (in millions) – .76, and the slope (.61) is significant, you have your answer. Now Phil says a number of things I disagree with, so I’m going to step in at different points. Let’s start here.
The conclusion (for Phil) at this point is that there clearly is a relationship, a positive one, between salary and wins. Also, it would be hard to argue otherwise since the equation says that highest spending team would be expected to win 60 games while the lowest would win 27. This is, of course, an early and unwarranted conclusion. The regression can only tell you about what is in the regression equation. It’s possible, and in fact likely, that salary is confounded with other factors that might explain wins. For example, maybe better players are paid more, and so (generally speaking) teams that spend more will win more. In fact, when variables are confounded, you can have some crazy stuff happen. An example (taken from a class) would be predicting body fat from tricep skin thickness, thigh size, and midarm size. These variables will obviously be related to each other. If you run a regression containing all three variables, you’ll find in fact that none of the slopes for those variables are significant, but the regression itself is highly significant (p=7×10 -7, R squared= .80). Why are none of the slopes significant? The covariance between them messes up the error terms for the coefficients. So if we were to simply trust the regression equation we would be in trouble, even though obviously we have explained something here.
Next Phil argues that R squared doesn’t tell you much about the relationship between salary and wins; the fact that in the NBA data from last year the R squared is only 25.6% is inconsequential because the variance could be really big. He makes an analogy to buying a car and points out that the same number can be expressed as different percentages of different numbers (like it could be 700% of your monthly salary, or .01% of Bill Gates’ pocket change). This is simply wrong. Once you have a data set, the variance is set. In terms of wins in the NBA last year, that number is about 199. Now it’s true that the number doesn’t intuitively mean anything, and thus the fact that salary explains 51 is somewhat meaningless. But, what the 25.6% (51/199) tells us is that a lot (e.g., the other 75%) of what causes different teams to win different amounts of games is *not* explained by salary (in fact, R squared is also called ‘coefficient of determination’, and is defined as telling you how much of the variance in Y you have explained). More importantly, there are not different numbers that could come in here. We have our data, our number is 199, and we can only explain 51 with the current model.
Or maybe I’m wrong? Phil says that you can actually play around with R squared. For example, if you group NBA teams into triplets and work with their combined salary and wins, you change the R squared. For example, you now have the team “Knicks-Cavs-Mavericks”, who spent 276 million dollars to win 148 games. If you run that regression, you get the equation wins = .68*salary -17.5. So the relationship between salary and wins is pretty much the same – still .68 million per win. But now the R squared is .49! (these numbers are different from their article, not sure why. But we’ll see soon that it’s irrelevant.) Phil notes that the regression equation hasn’t changed “because we arranged the data differently”, but we have “arbitrarily” increased the R squared. This is also wrong, and in a couple places. Let’s start with combining the data. Unless there’s a really good reason to do this, and in this case there isn’t, you should never combine your data. Why? Because now you’ve sucked variance out of your data. For example, we’ve removed any differences between the Knicks, Cavs, and Mavericks. You’ve taken information out of the system. This is what we would call a “no-no”, or possibly “data massaging at a level that would get you kicked out of your profession”. It gets worse for Phil: the regression equation only stayed about the same because he grouped teams in order of salary. This means that he has maintained the ordering between salary and wins, and so it maintains the positive relationship. There are still consequences, however – while the slope stays about the same and the R squared goes up, the significance of the slope drops. Let’s say instead that I “arbitrarily” grouped into fives instead of threes. The equation is now wins = .73*salary-45, the significance on the slope is only a trend (.055), and the R squared is .64. The slope is getting to be kind of different from what we started with, and the significance is dropping quickly. And, it only is still looking somewhat ok because we kept the salary ordering (it should also work if you ordered by wins and grouped teams that way). Let’s say instead that I randomize the teams and then group them. The regression will fly all over the place across randomizations, becoming super-significant, non-significant, and everything in between. Both the slope and the R squared will change. This is because you have started messing with how the variance within teams is being ‘hidden’ by grouping them. If you treat your data properly, you *cannot* massage R squared. If you don’t, you can change whatever you want, not just R squared, and the regression equation is not immune.
Let me give another example as to why grouping teams is nonsensical. Let’s say you’re Mark Cuban, owner of the Mavs. You’ve hired Phil, who runs his regression with team groups. Phil walks in one day with a big smile and says “Hey Mark! Dan Gilbert, owner of the Cavs, just ok’ed a signing which increases the Cavs’ salary by 10 million!”. If you’re Mark, what do you think? The extra 10 million that one team spent means that your group should win an extra 6 games. Will you win any of those games? Will the Cavs win them all? Could the Knicks win more games because the Cavs spent more money? The answer, which should be evident, is that only the Cavs will be affected. However, the regression equation is agnostic. All it says is that the group will win, on average, 6 more games. Stacey Brook, the Wages of Wins guy, would instead walk right in and say “Well, if you’re just going by my equation, you’d better spend some more money to catch up”. Although he’d probably actually tell you to sign people who play well.
However, it should be immediately evident that it is a mistake to follow either equation. Let’s say now that I’m Joe Dumars, in charge of the Pistons. Last year the team spent $71 million (10th in the league ) but only won 39 games (good for 8th in the East, but something like 16th or 17th overall in the league). Let’s say you figure that you need to win about 65 games to get first in the East and return to prominence. Following the regression equation, you could figure out that you need to spend about $108 million to expect to win 65 games. So you could decide to take your exact same team from last year and give each of the 15 players a $2 million (and change) raise. This would be the extra $30 million or so you need to get your salary to $108 million and win 65 games. Does this make sense at all? It shouldn’t. And that’s because salary does not in fact explain much about winning. Instead, salary is an intermediate variable that covaries with player quality, and player quality determines who wins. Now you wouldn’t know this if all you knew in the world was wins and salary, which is the case with the regression equation. But the fact that the R squared is relatively low gives you a hint that you might want to look into some other variables and see if they explain more about wins. For example, if you were the Wages of Wins authors, you might explain wins with team offensive and defensive efficiency. It turns out that this explains something like 98% of the variance in wins. That is, if you know a team’s efficiency values, you know almost everything there is to know about how many games they’ll win. There is some other factor which influences outcomes a little bit, but not very much.
So to summarize, salary does have something to do with wins, but only if you don’t consider other factors. Even if you do leave it just at salary, you don’t explain very much – a lot of the differences in team wins is due to something other than how much money they spend. And if you start massaging your data to try and make a point, you should probably know what you’re doing.