If you haven’t already, make sure you check out part 1 for the methods and part 2 for the results. In this post I’ll talk about what the results for the retrodiction contest might mean and some caveats.
One clear result is that APM is not especially accurate in evaluating players. It could be argued, as I suggested in part 2, that part of the poor performance is my assumption of replacement level productivity. But, as I said in part 2, given that these players are not on the court much, it doesn’t seem like that should lead to the big drop in predictive power.
Following that idea, it is clear that the regularization procedure improves on APM. That isn’t surprising, given that Joe Sill demonstrated with another analysis that RAPM is clearly better. But it appears to help enough to vault the non-boxscore measure past WP. With only one season of ezPM to compare to, it’s unclear if RAPM is clearly better than that boxscore metric.
Why might RAPM do so well? The main feature of RAPM is to move all players closer to average. In 2010, RAPM thought that about 55% of players were between -1 and 1 points per 100 possessions. On the other hand, ezPM100 has a much larger range of points per 100 possession values and only placed about 26% of players in the -1 to 1 range in 2010, and it did well. Assuming I converted correctly, WP places 20% of players in an equivalent productivity range (again using 2010 so as to compare the metrics on the same data). So simply assuming that many players are average is not sufficient since ezPM did pretty well and WP isn’t too far off. Given that RAPM is a black box, it’s hard to say what exactly it’s picking up on.
In contrast, ezPM has a boxscore formula, so we could try in theory to look at why it did well in its one season. I’m sure that many APBRers will want to suggest that rebounds are the main issue. However, ezPM differs from WP virtually across the board in how it weighs different boxscore stats. It also uses more stats than WP since it’s based on play-by-play; it knows if a shot has been assisted or not, if a free throw was an and-one, etc. Additionally, it incorporates counterpart defense from play-by-play data, so it arguably has a richer picture of player defense than WP uses. It would be interesting to see how Arturo or Ty’s version of WP with counterpart defense would fare in this test. In short, ezPM uses much more information than WP does, and it is hard to say which part of that lead to its better performance in retrodicting 2011.
A different kind of explanation was suggested by Guy in the comments to one of my previous retrodiction posts. Wins Produced, as the name suggests, attempts to predict team wins. But as you probably know, point differential is more predictive of team quality and future performance. Predicting wins adds a certain amount of random noise. RAPM, APM, and ezPM, on the other hand, are inherently predicting point differential. Thus they may get a certain advantage by predicting point differential instead of going through the conversion to wins.
A final potentially uninteresting (yet explanatory) difference is simply the scope that these metrics work over. APM and RAPM, by definition, work over the current season and try to explain each player individually. You could use multiyear APM/RAPM but they would still work at the individual player level and it’s unclear how much adding additional seasons really helps (the Joe Sill paper I linked to suggests not much if at all for RAPM). Boxscore metrics like ezPM and WP instead treat players as equivalent (or at least players at the same position), and are thus somewhat normative. While ezPM and WP try to figure out LeBron’s value using how much his stats are worth on average for any player, APM/RAPM figure out his value using how much he was worth specifically in the situations he was in. I’m thinking of it in sort of a fixed effect versus random effect way; APM/RAPM cares specifically about LeBron whereas ezPM and WP care only about the numbers that LeBron happened to put up. So APM/RAPM may have an advantage in that it is looking at a specific timeframe and at individual players. ezPM is not looking at individual players in the same way, but its use of counterpart defense does add this to a certain extent. Additionally, ezPM is timeframe-based to the extent that some of its weights are based on recent seasons. For example, if you look at the defensive weights you see 1.06 show up a lot. Why 1.06? Because that’s the league average points scored per possession. But that’s only true in the recent past; scoring efficiency in the league used to be much lower. And this may be where WP loses some ground. Because it is based on a regression that covers a large number of seasons, the weights used in WP are normative not only across players but also over time. It’s important to note that WP does adjust by league-average and position-average production, but those use the same weights; if the relative values of, say, a missed field goal and a steal fluctuate somewhat over time, this will not be picked up on especially well in WP. Thus ezPM may be doing well currently because it is a new measure and the league hasn’t changed much yet; potentially it would do worse explaining previous seasons or may need to be adapted in the future if it were to use the same weights in every year.
One last thought has to do with the change in accuracy when metrics were allowed to use actual rookie performance as opposed to giving each rookie an average rookie productivity. RAPM tended to do worse with the actual values (as did ezPM to a small degree in the one season available) while APM and WP got better. This is an intriguing result. One might conclude that APM and WP tend to overfit the data; they do better at predicting outcomes when known data (e.g. actual performance from that season) is included; that idea has at least been proposed for APM’s failures in other tests. Another possible conclusion is the inverse for RAPM; it has good predictive power at the cost of explaining current results. Within a season, rookies are valued alongside veterans that are assumed to be close to average. This may lead to players being slightly misvalued if most players are not in fact close to average, and thus RAPM does worse when using its own rookie values. But across seasons, the assumption of being average serves as adding a regression to the mean component which benefits prediction. Given the results of the contest it would appear that the regression to the mean outweighs the potentially mistaken values, but it is something to think about.
In terms of my own feelings, I’m surprised that RAPM did so well. I’ll certainly give more weight to the method than I did in the past, although I’ll continue to wish that it weren’t a black box. I’m not surprised that APM did so poorly, although it was nice to see it spelled out in the numbers. Hopefully we can never speak of it again, although since it seems to keep coming up on ESPN I doubt that will be the case. And more broadly, I hope that this kind of analysis catches on as a fair method of comparing metrics. When ten people use ten different sets of assumptions as inputs into their predictions, it’s hard to tell why one person ‘wins’ and another person ‘loses’. But when the predictions are made as similarly as possible, then it becomes easier to tell which metrics are honing in on true player value. I wouldn’t consider this the final word, but a somewhat thorough first good attempt.