A number of years ago, when I was becoming more familiar with the NBA end of sports statistics, I heard about an unpublished manuscript by Dan Rosenbaum and David Lewin that claimed to show that pretty much any statistic (including simple things like points per game) is equally good at predicting future team wins. It did so by granting each statistic a ‘team adjustment’. This post is going to go through my understanding of that process to show why the analysis was almost guaranteed to find the result it did.
First we need to have some player data. I created simulated data the same way I did for a previous post, so you should go over and take a look if you’re interested. In short, I made four fake statistics each for players arbitrarily labeled ‘center’, ‘PF’, ‘SF’, ‘SG’, and ‘PG’. These variables don’t correlate with each other, but they correlate with themselves across seasons. I made two seasons for every player. Players are then randomly combined into teams. All of my data are created as standardized scores, so the player names are indeed arbitrary, although I could scale them to actual NBA values if I wanted to. I also created a ‘junk’ statistic for each player in both seasons; it’s just a random variable completely uncorrelated with anything else, including itself from year to year. You’ll see why later.
Having done that, I need an efficiency metric. I used the fake WP and fake PER metrics from the previous post. I also need my teams to have wins. I gave each team wins according to my fake NBA Efficiency weights, multiplied onto the sum of the players’ statistics on a team (e.g., team-level values for the four statistics) plus a small random noise variable (mean 0, standard deviation .03). So now I have five players on a team, each with four statistics (plus the junk one), each player’s efficiency by two metrics, and how many games that team won, all for two seasons. I made 500 teams. I can note that, as it should be, I again found that fake WP correlates better from season to season than fake PER. It turns out that with the noise in this system, team wins only correlates at .63 between the two seasons, despite no players being traded or hurt or what have you. It comes from the variability across seasons in the players’ four metrics; I can remove the noise variable from the wins equation and the correlation barely nudges.
Here comes the model evaluation part. Let’s say I want to know which model better explains wins, fake WP or fake PER. I check by adding up the WP/PER values for the players on a team and correlating that with the team’s wins. Fake WP does a good job, with r= .94, while fake PER lags behind a bit at .87. This is to be expected, since the weights in fake NBA Efficiency used to calculate wins are closer to the fake WP weights than the fake PER weights. The effect becomes more pronounced if I use the year 1 team WP or PER values to predict next year’s team wins. The combination of actual production noise and fake PER’s weighting of that production drop the correlation to .48 while fake WP maintains a correlation of .64. So not only does fake WP explain wins better this year, but it also better predicts wins next year.
Here’s where Rosenbaum and Lewin’s mistake comes in. They argue that the actual WP uses a team adjustment that other metrics do not. So they grant a team adjustment to the other metrics. This adjustment comes from the residual of the regression predicting team wins from the summed team metric scores (that’s my version; their paper uses point differential). The residual of a regression is simply the difference between the predicted value from a regression equation and the actual value. For example, if WP thought that the Pistons would win 28 games but they actually won 26, the residual is 2 (or -2, the order of subtraction is arbitrary for most purposes). The residual reflects all the stuff that the variables in your regression don’t explain, such as missing variables (not an issue here), random noise (an issue), or incorrect weightings (if created by hand, which is true here since I made up the weights). So as you would expect, the regression using fake PER has larger residuals than the regression using fake WP. It has a lower correlation with wins, which in this case means it makes worse predictions.
The adjustment involves taking each team’s residual and giving it back to the players. Since I have five players that all ‘played’ the same number of minutes, I took each team residual, divided it by 5, and added it to each player’s metric. For example: the first team in my data set, according to fake WP, is atrocious. Three of the five players are below average, with one being particularly bad. According to the regression of wins from summed player fake WP, they should have won -.62 games (remember it’s a scaled score; think of it as a 35 win season). They actually won -.4 games, so they performed better than predicted. The residual is thus .22, which is divided by 5 and added to each player. Looking at this ‘adjusted’ fake WP, the players (and team) are still bad, but not as bad as before. Fake PER had a rosier view of the team; it thought one player was pretty bad, one pretty good, and the others roughly average. So the team PER was essentially 0. The PER regression thought the team would be just below average, -.056. But as we know, they only won -.4 games. So the fake PER residual is -.34, and each player is downgraded.
What effect does this team adjustment have? First, it pushes players closer together in the eyes of the metrics. Whereas fake WP and fake PER only have a correlation of .63, adjusted fake WP and adjusted fake PER have a correlation of .72. Second, it pushes teams even closer together than the players. Team-level fake WP and fake PER only had a correlation of .64, but their adjusted counterparts have a correlation of .987. The players are independent of each other and essentially the same, differing only by random noise. When you add five of them together, some of that noise cancels out. When you add five of them together after having already made them more similar, there’s virtually no noise left. Third, as you may have guessed from the second point, it makes the metrics virtually identical in terms of predicting wins. They’re identical because they hit the roof. The team level adjusted fake WP has a correlation of .999 with team wins while adjusted fake PER has a correlation of .992. Why? Because you took everything the models couldn’t explain (the residuals), and then put them back in the model. Thus they can now explain virtually everything. If we use year 1’s metrics to predict wins in year 2, both adjusted fake WP and adjusted fake PER have correlations of about .62. Remember that wins itself only correlated at .63 and fake WP had about the same correlation; that’s the functional ceiling on predicting future wins. Adding the team adjustment rockets fake PER up to the same level.
Now let’s take it to the extreme. Remember that junk variable I made? It correlates with none of the other variables. If you treat it as its own metric and sum it to the team level, it has a correlation of 0 with team wins. What happens if I give the junk metric the same team adjustment treatment as above? The correlation is now .45. The team-level adjusted junk metric in year 1 predicts team wins in year 2 with a correlation of .31. This obviously isn’t as good as fake WP or fake PER, which are at least considering the variables that lead to wins, but it’s relatively high and certainly significant. Using Rosenbaum and Lewin’s misguided residual-based team adjustment, even a completely worthless variable gains predictive power.
So, to conclude: if you take a model’s residuals and feed it back into the model, it will appear extremely powerful. If you do so in order to compare two models, and both models are at least on the right page with what they’re predicting, the two models will become similarly powerful. This adjustment is so potent that it will take a completely unrelated variable and make it look decent. Again, this happens because you basically take everything the model couldn’t explain and allow the adjusted model to explain it. And it obviously isn’t a good thing; it’s a fairly blatant misuse of statistics.