## Retrodiction Contest Part 3: The Conclusions

If you haven’t already, make sure you check out part 1 for the methods and part 2 for the results.  In this post I’ll talk about what the results for the retrodiction contest might mean and some caveats.

One clear result is that APM is not especially accurate in evaluating players.  It could be argued, as I suggested in part 2, that part of the poor performance is my assumption of replacement level productivity.  But, as I said in part 2, given that these players are not on the court much, it doesn’t seem like that should lead to the big drop in predictive power.

Following that idea, it is clear that the regularization procedure improves on APM.  That isn’t surprising, given that Joe Sill demonstrated with another analysis that RAPM is clearly better.  But it appears to help enough to vault the non-boxscore measure past WP.  With only one season of ezPM to compare to, it’s unclear if RAPM is clearly better than that boxscore metric.

Why might RAPM do so well?  The main feature of RAPM is to move all players closer to average.  In 2010, RAPM thought that about 55% of players were between -1 and 1 points per 100 possessions.  On the other hand, ezPM100 has a much larger range of points per 100 possession values and only placed about 26% of players in the -1 to 1 range in 2010, and it did well.  Assuming I converted correctly, WP places 20% of players in an equivalent productivity range (again using 2010 so as to compare the metrics on the same data).  So simply assuming that many players are average is not sufficient since ezPM did pretty well and WP isn’t too far off.  Given that RAPM is a black box, it’s hard to say what exactly it’s picking up on.

In contrast, ezPM has a boxscore formula, so we could try in theory to look at why it did well in its one season.  I’m sure that many APBRers will want to suggest that rebounds are the main issue.  However, ezPM differs from WP virtually across the board in how it weighs different boxscore stats.  It also uses more stats than WP since it’s based on play-by-play; it knows if a shot has been assisted or not, if a free throw was an and-one, etc.  Additionally, it incorporates counterpart defense from play-by-play data, so it arguably has a richer picture of player defense than WP uses.  It would be interesting to see how Arturo or Ty’s version of WP with counterpart defense would fare in this test.  In short, ezPM uses much more information than WP does, and it is hard to say which part of that lead to its better performance in retrodicting 2011.

A different kind of explanation was suggested by Guy in the comments to one of my previous retrodiction posts.  Wins Produced, as the name suggests, attempts to predict team wins.  But as you probably know, point differential is more predictive of team quality and future performance.  Predicting wins adds a certain amount of random noise.  RAPM, APM, and ezPM, on the other hand, are inherently predicting point differential.  Thus they may get a certain advantage by predicting point differential instead of going through the conversion to wins.

One last thought has to do with the change in accuracy when metrics were allowed to use actual rookie performance as opposed to giving each rookie an average rookie productivity.  RAPM tended to do worse with the actual values (as did ezPM to a small degree in the one season available) while APM and WP got better.  This is an intriguing result.  One might conclude that APM and WP tend to overfit the data; they do better at predicting outcomes when known data (e.g. actual performance from that season) is included; that idea has at least been proposed for APM’s failures in other tests.  Another possible conclusion is the inverse for RAPM; it has good predictive power at the cost of explaining current results.  Within a season, rookies are valued alongside veterans that are assumed to be close to average.  This may lead to players being slightly misvalued if most players are not in fact close to average, and thus RAPM does worse when using its own rookie values.  But across seasons, the assumption of being average serves as adding a regression to the mean component which benefits prediction.  Given the results of the contest it would appear that the regression to the mean outweighs the potentially mistaken values, but it is something to think about.

In terms of my own feelings, I’m surprised that RAPM did so well.  I’ll certainly give more weight to the method than I did in the past, although I’ll continue to wish that it weren’t a black box.  I’m not surprised that APM did so poorly, although it was nice to see it spelled out in the numbers.  Hopefully we can never speak of it again, although since it seems to keep coming up on ESPN I doubt that will be the case.  And more broadly, I hope that this kind of analysis catches on as a fair method of comparing metrics.  When ten people use ten different sets of assumptions as inputs into their predictions, it’s hard to tell why one person ‘wins’ and another person ‘loses’.  But when the predictions are made as similarly as possible, then it becomes easier to tell which metrics are honing in on true player value.  I wouldn’t consider this the final word, but a somewhat thorough first good attempt.

This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink.

### 17 Responses to Retrodiction Contest Part 3: The Conclusions

1. Guy says:

Excellent work, Alex. And a fair-minded writeup. A few miscellaneous thoughts:

It’s not surprising that ezPM outperforms WP. The two are similar in many ways, but ezPM fixes some of the most obvious errors in WP. I don’t think the valuations of the boxscore stats in the two metrics differ as much as you suggest — in most cases they seem pretty similar.

What’s most interesting IMO is the strong performance of RAPM. One advantage of RAPM (and ezPM to a lesser extent) is capturing defensive contributions more than WP can, as you suggest. Another potentially important difference is that RAPM works hard to isolate a players’ impact from the performance of teammates. Ironically, WP fans have long criticized PM metrics for being biased by other players’ performance — which can indeed be a problem — but have downplayed the same danger in WP because of the unspoken assumption that players completely “own” their boxscore stats. But Berri’s own estimate of diminishing returns means that each extra win a player delivers is partially offset by a loss of about .3 wins from other players. We can argue about whether that’s really a function of reducing teammates’ productivity vs. a flaw in the metric that misallocates value, but the result is the same: WP will overestimate a good player’s actual win contribution and exaggerates the damage done by weak players (another reason to think WP’s implied spread of talent is too wide). It appears that RAPM is doing a better job of isolating an individual player’s true impact.

That would also explain why using actual rookie performance improves WP so much. Because a significant part of WP is other players’ work, using the rookie’s actual WP is giving the metric an artificial boost. “Actual WP” and “actual rookie performance” are not necessarily the same thing — if they were, you wouldn’t need to to this test! The fact that RAPM actually loses accuracy when using actual rookie ratings is an interesting puzzle. I’ll have to give that some more thought.

I’ll repeat two old suggestions, in case they interest you:
1) run this analysis on the top 50% of teams in term of personnel change from prior year, to see if the differences in predictions are more pronounced or any different.
2) give us a “stupid model” benchmark to compare these all to. How well can you predict using only players’ points per minute? Or just prior season MP? How much do these metrics improve on a very simple model like that?

2. Guy says:

I also think your findings on the implied spread of talent are interesting. My own guess is that the reality is somewhere between the very narrow RAPM spread and the much wider spread in ezPM and (especially) WP. After all, we know that the spread among entire teams isn’t that large — I think the SD is around 4 or 4.5 points. That would seem to suggest a SD at the player level of maybe about 2.0 points, or about 38% of players between -1 and +1. (And this is complicated by the fact that player talent is presumably more of a pyramid than a normal distribution.)

In any case, how much variance there is probably matters less than whether a metric actually measures useful contributions to winning. But I do think that the very large spread in the other two metrics is another sign that they are including outcomes not solely due to the contribution of the player being measured. And this seems to be more true of WP than ezPM (and you can probably guess why I think that’s the case).

• Alex says:

My other reply has some range info. But from 2006-2011, the SD in team point differentials is 4.55. Why does that suggest a player SD of 2?

• Guy says:

The team variance will be the sum of the player variances, if you assume teams are constructed such that players are reasonably independent. So if the player SD were 2, then at the team level the SD would be sqrt (5 * 2^2) = 4.5. This is roughly what we see in baseball: team variance in runs scored is about 9X the player variance. Now, it may not be strictly true in basketball. I suppose the salary cap, for example, might artificially limit the disparity between teams. But I think it gives you a decent ballpark estimate of what a plausible spread of talent is. It seems implausible, for example, that the player SD could be as high as 3 (which would produce a team sd of about 6.7 points).

BTW, this same logic is one reason we can be sure that players with high rebound totals “take” rebounds from teammates. The player SD for rebounds is as large or larger than the team SD. If there wasn’t huge negative correlation among teammates, then the variation among teams in rebounding would be vastly greater than it is.

• Alex says:

That would seem to make a lot of assumptions besides independent players. Doesn’t it also assume that the players all get equal minutes? Also, doesn’t adding up the players on a team give you the expected standard deviation for that team across games, not across teams?

And not to whip a dead horse, but the same logic would also imply we should severely curb assists as well.

• Guy says:

No, it gives you the SD for teams that you would get if you randomly assigned all NBA players to teams. If you think it’s obvious that existing teams have more variance (or less) than you’d get with random assignment, then you’d need to adjust for that (but it seems like a decent rough approximation to me).

But yes, I was using a very simple model, in which teams have 5 full time players. So really what it says is the SD for productivity teams get from each position — not each player — is around 2 points. You could obviously develop a more complex model. If you thought that the talent spread for bench players was rather narrow, for example, then that could mean the spread for starters is a bit larger. So maybe the SD for starters could be 2.5? Anyway, I was just suggesting this as a rough sanity check for the metrics. Would it be easy for you to report the SD for each metric for players with non-trivial numpber of MP?

And yes, the same logic could apply to assists. Clearly, some players (mostly PGs) are given vastly more opportunities to record assists than other players. That probably does mean you need to evaluate PGs (especially) in comparison to what a replacement level PG would give you, rather than just assigning X value to all assists. A parallel situation is the pitcher in baseball. He runs the defense like a PG runs the offense, but with even more control. Some defensive accomplishments can be done ONLY by the pitcher (e.g. a strikeout). And even the worst pitcher will get some strikeouts (as I would guess the worst PG will get some assists). So we don’t compare a pitcher to other fielders in those categories — we compare them only to a replacement level pitcher. I think Evan would agree that valuing assists properly is one of the bigger challenges in constructing these kind of metrics.

• Alex says:

2010 is the only year I have overlapping data for each method. If I use a 1000-minute cutoff, that leaves 243 players. The SD in points per 100 possessions for WP (converted from wins), ezPM, RAPM, and APM are 3.32, 2.59, 1.83, and 4.67. The ranges are [-8.69, 11.9], [-6.78, 11.49], [-4, 6.7], and [-13.47, 18.64]. The standard deviations don’t change much if I move the cutoff down to 800 minutes, which adds 38 players.

• Guy says:

Thanks for the data. To me, the SD for APM is clearly too large, and I think WP is too large as well. It’s hard to see how the player SD can be nearly as large as the team SD, unless you believe in very large diminishing returns. And even if you do, then it seems to me you need to incorporate that into your estimate of player values. What’ s the point of saying “player X added 12 wins,” if you know that he also took 4 wins away from other players? In fact, that player is going to add 8 wins to your team. RAPM looks like it may regress players too much, but hard to say for sure. All of that said, measuring production correctly is more important than getting the right SD — figuring out the right regression for forecasting is relatively easy once you know how to properly measure productivity.

Have you looked at the distribution of ratings within any of the metrics? What I think it should show is a skewed distribution in which a majority of players have negative values. I’d expect a lot of guys to be between, say, -2 and zero. And I’d expect to see a few +4 and +5 SD players, but less extreme values on the negative side.

• Alex says:

I’ve looked at the plots for 2010, since it’s the only year where they all overlap (presumably talent distribution is fairly stable over years, but who knows). RAPM only places 54% of players below average; ezPM and WP are more like 63%. APM places 69%, but part of that is due to the replacement player level. If you take out my assumed level of performance, 60% of the rest of the group is below average. RAPM places 45% of players between -2 and 0 while the number is 25% for ezPM and 21% for WP and APM. RAPM and APM have longer positive tails while WP has a longer negative tail and ezPM is fairly symmetric and very wide. Neither RAPM nor WP place anyone over 4.5 SD; ezPM goes as high as 11.9. APM is closer to RAPM and WP, particularly if you take out replacement players. RAPM and APM put the worst player at about -2.7 SD while it’s -8 for WP and -12.5 for ezPM. So I guess RAPM sounds closest to what you describe, but I’m still pretty curious about its failure to describe the current season as well as any of the other metrics.

3. EvanZ says:

Great stuff! Sounds like you’re starting to come around on the RAPM, which I appreciate.

I can’t remember now if I suggested you take a look at LambdaPM which is essentially like a blended boxscore-APM metric that tries to optimize both. There’s a thread on APBR, but unfortunately, only 2011 data are available so far.

oh, of course, also happy that ezPM beat WP 😛

• Alex says:

I did see your comment about lambda. I can’t use any metric with only one season, especially since it would have to predict a season that may not be played 😦

4. EvanZ says:

“I’ll certainly give more weight to the method than I did in the past, although I’ll continue to wish that it weren’t a black box.”

One thing that still isn’t clear to me is why you keep insisting that RAPM is a “black box”. I know you understand the basics of the method. It’s been published and made publicly available. It’s no more a black box than WP, which also uses regression to determine the weights of various box score categories. Why is that any less of a black box?

• Alex says:

It’s a black box in that I can’t tell why someone gets the rating they do. As I’ve said in the past, even if you don’t like WP (or ezPM, or Win Shares, or PER…) you could adjust it if you wanted to. If someone thinks that Kevin Love is overvalued because of his rebounds, you can discount it. But if RAPM likes Kevin Love, I have no idea why. Does he set good picks? Does he spread the floor with his inside-outside game? Is it because he rebounds a lot? I have no idea. Someone could create a ‘statistical RAPM’, where they predict RAPM from boxscore or other measures, but I’m guessing that it wouldn’t have great predictive power if APM is any indication. So even if you assume that RAPM is “correct”, you would still be limited in using it for team construction because you don’t know what a player does to get his rating or how two players would overlap or complement each other in their abilities. I guess you could argue that WP is a black box in terms of determining why a point/possession is worth .033, but beyond that it’s transparent and player ratings are obvious as well. I know exactly why Kevin Love gets the value he does for any boxscore-based metric.

• EvanZ says:

“Someone could create a ‘statistical RAPM’, where they predict RAPM from boxscore or other measures, but I’m guessing that it wouldn’t have great predictive power if APM is any indication.”

Daniel’s ASPM does exactly this, although with older APM data, I believe. In fact, his ratings are currently leading on the APBR retrodiction contest, beating RAPM and ezPM.

• Alex says:

I know of Daniel’s model, although I don’t think I’ve ever seen him explain how he picked the values for his ratings. It’s a very complicated model, to say the least. I also haven’t seen retrodiction or current season ‘prediction’ for anything beyond this season, although I expect his current season predictions are fine since he uses a team adjustment.