Illustrating Rosenbaum’s Folly

A number of years ago, when I was becoming more familiar with the NBA end of sports statistics, I heard about an unpublished manuscript by Dan Rosenbaum and David Lewin that claimed to show that pretty much any statistic (including simple things like points per game) is equally good at predicting future team wins.  It did so by granting each statistic a ‘team adjustment’.  This post is going to go through my understanding of that process to show why the analysis was almost guaranteed to find the result it did.

First we need to have some player data.  I created simulated data the same way I did for a previous post, so you should go over and take a look if you’re interested.  In short, I made four fake statistics each for players arbitrarily labeled ‘center’, ‘PF’, ‘SF’, ‘SG’, and ‘PG’.  These variables don’t correlate with each other, but they correlate with themselves across seasons.  I made two seasons for every player.  Players are then randomly combined into teams.  All of my data are created as standardized scores, so the player names are indeed arbitrary, although I could scale them to actual NBA values if I wanted to.  I also created a ‘junk’ statistic for each player in both seasons; it’s just a random variable completely uncorrelated with anything else, including itself from year to year.  You’ll see why later.

Having done that, I need an efficiency metric.  I used the fake WP and fake PER metrics from the previous post.  I also need my teams to have wins.  I gave each team wins according to my fake NBA Efficiency weights, multiplied onto the sum of the players’ statistics on a team (e.g., team-level values for the four statistics) plus a small random noise variable (mean 0, standard deviation .03).  So now I have five players on a team, each with four statistics (plus the junk one), each player’s efficiency by two metrics, and how many games that team won, all for two seasons.  I made 500 teams.  I can note that, as it should be, I again found that fake WP correlates better from season to season than fake PER.  It turns out that with the noise in this system, team wins only correlates at .63 between the two seasons, despite no players being traded or hurt or what have you.  It comes from the variability across seasons in the players’ four metrics; I can remove the noise variable from the wins equation and the correlation barely nudges.

Here comes the model evaluation part.  Let’s say I want to know which model better explains wins, fake WP or fake PER.  I check by adding up the WP/PER values for the players on a team and correlating that with the team’s wins.  Fake WP does a good job, with r= .94, while fake PER lags behind a bit at .87.  This is to be expected, since the weights in fake NBA Efficiency used to calculate wins are closer to the fake WP weights than the fake PER weights.  The effect becomes more pronounced if I use the year 1 team WP or PER values to predict next year’s team wins.  The combination of actual production noise and fake PER’s weighting of that production drop the correlation to .48 while fake WP maintains a correlation of .64.  So not only does fake WP explain wins better this year, but it also better predicts wins next year.

Here’s where Rosenbaum and Lewin’s mistake comes in.  They argue that the actual WP uses a team adjustment that other metrics do not.  So they grant a team adjustment to the other metrics.  This adjustment comes from the residual of the regression predicting team wins from the summed team metric scores (that’s my version; their paper uses point differential).  The residual of a regression is simply the difference between the predicted value from a regression equation and the actual value.  For example, if WP thought that the Pistons would win 28 games but they actually won 26, the residual is 2 (or -2, the order of subtraction is arbitrary for most purposes).  The residual reflects all the stuff that the variables in your regression don’t explain, such as missing variables (not an issue here), random noise (an issue), or incorrect weightings (if created by hand, which is true here since I made up the weights).  So as you would expect, the regression using fake PER has larger residuals than the regression using fake WP. It has a lower correlation with wins, which in this case means it makes worse predictions.

The adjustment involves taking each team’s residual and giving it back to the players.  Since I have five players that all ‘played’ the same number of minutes, I took each team residual, divided it by 5, and added it to each player’s metric.  For example: the first team in my data set, according to fake WP, is atrocious.  Three of the five players are below average, with one being particularly bad.  According to the regression of wins from summed player fake WP, they should have won -.62 games (remember it’s a scaled score; think of it as a 35 win season).  They actually won -.4 games, so they performed better than predicted.  The residual is thus .22, which is divided by 5 and added to each player.  Looking at this ‘adjusted’ fake WP, the players (and team) are still bad, but not as bad as before.  Fake PER had a rosier view of the team; it thought one player was pretty bad, one pretty good, and the others roughly average.  So the team PER was essentially 0.  The PER regression thought the team would be just below average, -.056.  But as we know, they only won -.4 games.  So the fake PER residual is -.34, and each player is downgraded.

Now let’s take it to the extreme.  Remember that junk variable I made?  It correlates with none of the other variables.  If you treat it as its own metric and sum it to the team level, it has a correlation of 0 with team wins.  What happens if I give the junk metric the same team adjustment treatment as above?  The correlation is now .45.  The team-level adjusted junk metric in year 1 predicts team wins in year 2 with a correlation of .31.  This obviously isn’t as good as fake WP or fake PER, which are at least considering the variables that lead to wins, but it’s relatively high and certainly significant.  Using Rosenbaum and Lewin’s misguided residual-based team adjustment, even a completely worthless variable gains predictive power.

So, to conclude: if you take a model’s residuals and feed it back into the model, it will appear extremely powerful.  If you do so in order to compare two models, and both models are at least on the right page with what they’re predicting, the two models will become similarly powerful.  This adjustment is so potent that it will take a completely unrelated variable and make it look decent.  Again, this happens because you basically take everything the model couldn’t explain and allow the adjusted model to explain it.  And it obviously isn’t a good thing; it’s a fairly blatant misuse of statistics.

This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

14 Responses to Illustrating Rosenbaum’s Folly

1. +1
🙂

2. Pingback: Epic « Arturo's Silly Little Stats

3. Guy says:

But they do not then compare the metrics in that initial year (there would be no point to doing that). The clearest indication you didn’t understand the paper is when you write about your simulation: “despite no players being traded or hurt or what have you.” The whole point of (R/L) is to see how well metrics do at predicting FUTURE wins, after team composition changes. That is why they look 2 and 3 years into the future, to maximize the amount of change in team rosters. While the team adjustments make all metrics appear equally good in year N, it doesn’t ensure parity once team composition starts to change in year N+1, N+2, and N+3. At that point, you start to find out which metrics have the strongest “signal” of individual players’ true productivity. If a metric can explain only year N well, but not later years when team composition changes, then it isn’t dividing up credit for team wins accurately. And of course, what they find is that Wins Produced does a terrible job of predicting future wins. This is the whole point of the paper, and you don’t even address it or seem to understand it.

It’s of course true that when R-L add a team adjustment to PER or Efficiency, they may make those metrics better than they currently are. But the team adjustment is very crude: it just allocates everything the underlying metric failed to explain to players based on their MP, as WP sort of does with opponent shooting efficiency. (And of course it has predictive value in your model added to a junk stat — it’s a partial measure of last year’s efficiency, and you don’t change personnel at all. Why wouldn’t it be predicitive?) If WP is doing a better job of estimating the true value of boxscore stats than PER or EFF, then WP should still do a better job of predicting future wins even when all metrics have the benefit of a team adjustment. But it doesn’t. Which tells us that WP does a very poor job of identifying players’ actual contributions to winning.

• Alex says:

I had thought about adding another line to the post, which obviously I should have, noting that the WP team adjustment is not the same as what R/L did. What R/L did will lead any reasonable metric to become extremely strong. However, as Berri has published in other places (including his FAQ), if you apply the team adjustment actually used by WP to other metrics, like PER or NBA Efficiency, they still never reach the descriptive power of WP. So before team adjustment, these measures all include the same box score statistics. If you give them all the WP-style team adjustment, they continue to include all the same statistics. Yet WP predicts 95% of team wins while PER never gets higher than 82%. Why is that? I haven’t seen this adjustment applied to future seasons, but I would be surprised to find that metrics predict future performance relatively better than they predict current performance; I don’t think PER/NBA Efficiency/whatever would catch up to WP.

My post also predicts team performance in the future. Perhaps you didn’t read that far? The R/L adjustment makes my two simulated metrics equivalent at predicting the future as well as the current season. In the R/L paper, their summary of future predictions where all metrics get a team adjustment shows a range of .025 (.747 for PER to .772 for Alternate Win Score) with how they correlate with team wins. Perhaps with all the data these are significant differences, although they are obviously quite small either way. However, I can tell you that how they actually tested it is incorrect. Look at their equations 10 and 11. They claim they are equivalent. The first one tests a null hypothesis of B1=B2, which means that you would accept a model Wins = B0 + B1*(aPM1 + aPM2). The second equation tests lamba two (L2) = 0. That means the null hypothesis would lead to accepting a model Wins = L0 + L1*aPM1. These models are clearly not the same; the tests are not equivalent. Regardless, testing differences in the measures is meaningless. Their team adjustment makes all reasonable models the same, and so they should predict future wins the same besides some random noise.

The difference can be clearly seen in their adjusted +/- analysis. Adjusted +/- should have a ‘WP style ‘team adjustment built in; it includes everything that happens on the court, which would include opponent field goal percentage, team rebounds, team turnovers, etc, right? Yet it predicts future wins very poorly. Perhaps adjusted +/- needs a team adjustment?

• EvanZ says:

Alex, right now Hollinger is one of the only folks who is beating Vegas pre-season predictions. WP (WoW bloggers, including myself) is doing very poorly. So, using your logic, PER is “better”. Is that right?

• Alex says:

I can’t find his predictions in the ESPN archive right now, but my understanding is that he doesn’t use PER for the season predictions. More generally, I wouldn’t use the outcome from one season to decide who was doing the best in any event.

For what it’s worth, I know Hollinger uses something more like point differential and home court for picking playoff series in TrueHoop’s stat smackdown, and he’s behind my model, Kevin Pelton, and Jeff Ma over the past four years. In the three years they both picked, Berri beat Hollinger twice.

• Guy says:

Alex: Yes, I’m familiar with Berri’s odd exercise of adding his team adjustment to other metrics. It seem totally beside the point to me — of course each metric needs a unique adjustment to add up to point differential . And can we please drop the pretense that WP “predicts” current point differential? At the team level, WP IS point differential (read R-L’s appendix again if that isn’t clear to you.) When you include points scored and points allowed in your metric, that isn’t hard to achieve!

In any case, the less accurate an original metric, the cruder the team adjustment will be, and the metric + adjustment will still provide a poor estimate of an individual player’s true productivity. That will be revealed when trying to predict future wins, as team personnel and MP change. It’s revealing that Dr. Berri never acknowledges the key finding, which is that something as simple as player minutes played plus a crude team adjustment can predict future wins slightly better than WP. Whatever your qualms about the methodology, that tells us these modified metrics must have some important information that WP doesn’t have. If it isn’t a better evaluation of players’ contributions, what do you think it is?

(And yes, you looked at “future” wins. But you kept the players the same! So of course all metrics that sum to differential will do equally well at predicting the next performance of these same players. The whole point is to see what happens when personnel changes, so we know if credit for past wins was apportioned correctly.)

Look, the plain fact is that your entire original post is a confirmation of Rosenbaum and Lewin’s analysis. There isn’t anything substantive there they would disagree with, or that they didn’t already say in their paper. It’s a challenging paper and an untraditional methodology, so I can see how you got confused. But you should really re-read it until you get what they’re doing.

• Alex says:

The R/L team adjustment isn’t ‘crude’, it’s incorrect. It’s also not what WP does. Defensive efficiency is part of the team-level model; the defensive adjustment brings it to the player level the same way that the initial productivity measures bring offensive efficiency to the player level. The only difference is that the box score doesn’t list defensive stats at the player level, so they have to be attributed as team averages to the players on a given team.

And I don’t know why you disagree with adding the actual WP-type adjustment to other metrics. PER, NBA Efficiency, and WP all start by giving players credit for points, FGA, FTA and made, rebounds, steals, assists, etc. So it’s an apples-to-apples comparison to see which is best. The WP team adjustment gives players credit for opponent scoring, team rebounds, and team turnovers. If you give that adjustment to the other metrics, then they all continue to contain the same statistics and thus you have an apples-to-apples comparison with models differing on how they value the various stats. It appears that WP does the best in describing team wins on this equal footing. What’s wrong with this approach?

• Guy says:

“The R/L team adjustment isn’t ‘crude’, it’s incorrect. It’s also not what WP does.”

WP was specifically designed to sum to point differential at the team level. Its’ team adjustment completes that process. Other metrics, which were designed to measure individual player contributions without particular attention to predicting team wins, will not match point differential using the same adjustment. That doesn’t necessarily tell us which metrics are weighting boxscore stats the best in terms of evaluating individual players. R-L’s method ensures that every metric initially captures all the current team-level information. Then, any differences in predictive power going forward presumably are the result of how well they initially assessed player productivity. (And remember that R-L are not primarily trying to pick a ‘winner’ among the boxscore metrics, but comparing all of them to current NBA decision-making.) The key point is that “predicting” current wins isn’t a very important way to evaluate metrics. After all, WP includes both points scored and points allowed, so how hard is it to “predict” point differential with that information? AFAIK, only Berri thinks this is the important test of a metric, presumably because it’s a game he knows he will always win. The real test is predicting future performance (which Berri has been very careful for 6 years never to do).

But Alex, you seem to be avoiding the main point here. You read a very sophisticated and challenging paper, and it seemed to you that the authors had made a college freshman error (“I can predict results better if I include the residual of my regression”). So you write a snide post with language like “blatant misuse of statistics.” But which is more likely: A) that a PhD economist employed by the President’s Council of Economic Advisors, OMB, and the Cleveland Cavaliers has failed statistics 101, or B) that his complex analysis went entirely over your head? By jumping to conclusion A, all you have done is advertise to the world that B is true.

The same story applies to your critique of Brian Burke’s use of probability in calculating WPA. Or your criticism of Phil Birnbaum, who wrote a very thoughtful essay on what R^2 can and can’t tell us. You actually wrote this, which literally made me laugh out loud: “It’s possible, and in fact likely, that salary is confounded with other factors that might explain wins. For example, maybe better players are paid more…” What in the world did you think Phil meant? Of course he knows there is an intervening step between handing players a check and adding to the win column.

So my unsolicited advice to you is this: when a smart guy like Brian, Phil, or Dan says something that strikes you as a silly, elementary error, consider that an alternative explanation is much, much more likely. That’s not to say you can’t disagree with them — I have disagreed with all three at times. But if you accuse them of doing something extremely foolish, it’s not they who will usually end up looking the fool.

• Alex says:

The analysis didn’t go over my head; I think my conceptual replication shows that I understand what they did pretty well. As I said in my comment, I disagree with what they think it means and does. The paper itself says that the adjustment they use is in the “theoretical spirit” of WP, so I assume that they also know that it isn’t actually what WP does. It’s their version of what they think it does. And I could be wrong, but I disagree that they’re the same.

My post from last night, I think, makes it clear that Phil’s idea about R squared and sample size at least is wrong. I also believe him to be wrong about how to interpret R squared and coefficient significance tests. And, as I said in that post, we all (including Phil) think that salary is a stand-in. I’ve only ever seen him say it in comments, not in his posts themselves.

I’m curious though, did I get something wrong about the models in R/L equations 10 and 11? I’m open to learning if my statistical training has been wrong.

• ilikeflowers says:
• Guy says:

Alex, I’m at a loss for words here. You wrote an entire post confirming elements of Rosenbaum-Lewin’s analysis, yet thought you were writing a rebuttal. Yet you appear unfazed by — or perhaps still unaware of — this enormous error. So further discussion on this point doesn’t seem productive. I’ll just leave you with this thought (from personal experience, unfortunately): you will find that giving other people some respect — rather than assuming they are idiots — will serve you very well in life. And who knows, you might even learn something.

• Alex says:

I find this a curious comment coming from someone who has come to my website and disrespected me repeatedly. I never called them idiots or fools; I think their analysis is mistaken. Everyone makes mistakes, including myself. I try not to make personal attacks, and I don’t believe I’ve insulted anyone so far (although I might have been too strong with Phil). I’m not sure that’s a claim you can make.

4. nerdnumbers says:

Guy,
A polite suggestion. At this point your comments are starting to get to the length of the posts you’re discussing. As you may have noticed the Wages of Wins Network started in large part because many of the commentors at DJ’s blog had much more to say than felt right in a comment. WordPress.com and WordPress.org make it super easy to start your own. You could also consolidate thoughts, as I noticed you left a giant comment to Arturo’s tiny update that just had a length to this post.