NBA Retrodiction Contest Part 3: The Perfect Blend

In the first two posts in this series I looked at how well various NBA productivity metrics can explain what happened and predict what will happen.  But I think at least some people would say that they would never stick to a single measure when examining a trade, for example.  Here I’m going to take that idea seriously.

First I needed to decide what metrics to include in my pool.  I dropped APM quickly because I dislike it.  While PER fared poorly in my analysis, I left it in because I have so many years of it.  In contrast, I had to drop ezPM because there simply aren’t enough seasons of it.  I also dropped old RAPM since the new version is such an obvious improvement on it.  That means that I kept old and new WP, new RAPM, ASPM, Win Shares, and PER.

What did I keep them for?  I wanted to see what combination of the metrics did the best job at explaining what happened and predicting what will happen.  I did it the hard way.

I ran 100,000 iterations of a program that selected random weights (from a normal distribution with mean 0 and standard deviation 1) for each of the six metrics in the sample.  Those weights were then multiplied by each player’s rating on the respective metric, those values were summed, and that was divided by the sum of the weights to make sure the values moved back into a reasonable range.  This created the ‘blend’ rating for each player.  The same weights were also applied to his predicted score on each metric to create a predicted blend rating.  Those ratings were then summed up and compared to actual team performance the same way the existing metrics were checked previously.

After that finished chugging, I sorted the results by ‘explaining’ error as well as predictive error.  The best explanatory blend had an error of .107.  Of course, other randomly chosen weights also resulted in low errors.  Relative to how accurate the blends could be, the error increased somewhat quickly so I averaged together the weights for the ten lowest-error blends (as well as eye-balled the individual weights) to see what the common elements were.  It turns out that the best way to explain what happened in a season is to use a good chunk of old Wins Produced, a smaller chunk of ASPM, and a little bit of Win Shares.  New WP and PER are in there a little bit, but RAPM is pretty much absent.

Somewhat interestingly, this explanatory blend had a predictive error of about 2.7.  But this was not the best predictive blend possible.  When I sorted by prediction error, it turned out that you could get as low as 2.38, and you could average across the lowest 50 blends and not move too far off that error.  So I did that (the results don’t change much if you average across 10 as I did for the explanatory blend), and the best predictive blend is a good chunk of ASPM, a roughly equal but negative chunk of new WP, a smaller chunk of RAPM, slightly smaller and equal chunks of old WP and Win Shares, and nothing from PER.  Now I can already see some people’s eyes lighting up, thinking about the negative weight on WP.  Keep in mind that these metrics are all correlated with each other and I have both ‘flavors’ of WP involved.  The numbers and their signs would move around if I changed what metrics were in the mix.  But with this set, it looks like the best predictive rating relies on all of the metrics except PER.

To check the numbers out more consistently with what I did previously with the existing metrics, I made a rounded-off version of the explanatory and predictive blends.  The explanatory blend was created by adding together .5* old WP48, .35*ASPM, and .15*WS48.  The predictive blend was created with .5*ASPM, -.5*WP48, .35*RAPM, and .3*WS48 and old WP48, then dividing the sum by .95.  I then put those blends through the same code that I used for the previous post with the metrics.  Here are the same tables from my previous posts but with the blend metrics included.  Note that they only go back as far as I have new RAPM, since it’s necessary to calculate the blend.

As you can see, the explanatory blend does a better job of explaining what happened than any individual metric.  Similarly, the predictive blend does a better job at predicting than any individual metric.  But they rely on different combinations; explaining what happened is mostly Wins Produced and ASPM whereas predicting what will happen is a fairly equal blend of four metrics contrasted against a fifth; only PER gets left out.

Overall I think there are a couple of key points to take away.  One is that ASPM appears to do a really good job overall.  It describes what happened well, it predicts the next season well, and it contributes a good amount to both of the best blends.  Another is that the different metrics appear to be good at different things.  Wins Produced does a good job at explaining what happened.  While many people malign how WP treats rebounding, the fact of the matter is that rebounds are useful and someone grabbed them.  Comparing new WP to old, we see that it explains what happened slightly worse but makes predictions slightly better.

To make that second point a different way, I’ll make an analogy to the NFL.  People often make the distinction between predictive and narrative stats there.  Fumble recoveries is a good example.  Fumble recoveries are fairly random, with teams showing very little to no consistent ability to recover fumbles.  Also, a recovery can’t occur unless a fumble was forced in the first place, and forcing fumbles is somewhat consistent.  Thus fumble recoveries is a poor predictive measure and fumble forces is preferable.  However, if you tried to talk about why a game turned out the way it did, you would be lost without fumble recoveries.  And since in any given game the actual percentage of forced fumbles that are recovered by either team can vary wildly, fumble forces aren’t as good of an indicator.  Fumble recoveries are a narrative stat: they tell you what happened and who made it happen.  Fumble forces are a predictive stat: they tell you what is likely to happen in the future.

The same ideas appear to apply to the NBA metrics, and even their ‘best’ blends.  With that being said, even the ‘bad’ metrics do an ok job.  Despite not being in either blend, PER correlates at .81 with the explanatory blend and .65 with the predictive one.  Only APM and old RAPM have a low correlation with the explanatory blend, and everything correlates fairly well with the predictive blend with the exception of ASPM, which correlates the best due to its high contribution.  The explanatory blend has a correlation of .632 with the predictive one.  So you could get away with using one metric, but…

The final take-away is that using multiple metrics is the best way to go.  If all you had access to was one measure, you could do a decent job (depending on what that measure was).  But so many numbers are publicly available now that there isn’t much of an excuse to not gather as many viewpoints as possible.  For example, as I’m writing this RAPM would tell you that LeBron is only the 8th best player in the league, behind Dirk, Nick Collison, Ginobili, Dwight, Paul Millsap, Chris Paul, and Luol Deng.  Besides the fact that it appears that RAPM doesn’t do the best job of explaining current results, does that seem reasonable?  Win Shares has LeBron second behind Ginobili (among players with at least 100 minutes), and of the rest of the group Millsap is next closest at 4th.  WP has Manu first, then LeBron.  ezPM has LeBron at number one.  So a reasonable conclusion seems to be that LeBron has been the best player so far this year behind perhaps Ginobili.  Nick Collison, on the other hand, is below average according to ezPM, good but not great by Win Shares, and a bit above average according to WP.  He probably isn’t the second-best player in the league.  However, his high RAPM is a good sign for his productivity next year (mediated by his more average rating on the other metrics).

Just to have something a little fun at the end: using my explanatory blend, I can look at the best players of the past ten years by total production.  The top two are LeBron from 2010 and 2009; there’s a good reason the Cavs missed him last year.  Chris Paul takes two of the next three in 2009 and 2008, with Shaq’s 2000 getting in the middle at number 4.  The top ten rounds out with 2004’s Kevin Garnett, 2009 Wade, 2011 LeBron, 2010 Wade, and 2005 Kevin Garnett.  The top surprising season, potentially, is Ben Wallace’s 2002 (good for 20th); it wasn’t just WP that liked him.  The worst season goes to Michael Olowokandi in 2000; keep in mind that this it total production.  He got 2500 minutes that year!  His 2001 also comes in 4th.  Andrea Bargnani can only look on in awe; he came in 5th for his 2011 season.

This entry was posted in Uncategorized and tagged , , , , , , , . Bookmark the permalink.

24 Responses to NBA Retrodiction Contest Part 3: The Perfect Blend

  1. Guy says:

    Nice work, Alex. And congrats to Daniel, who appears to have earned bragging rights, at least for now.
    One quibble: I don’t think you can yet conclude that “using multiple metrics is the best way to go.” Couldn’t your method for determining a blended metric be over-fitting the data? That seems likely to me, and would also explain why so many of the metrics end up in the mix. I think you should build your blended metric on two seasons (say, 2008 and 2010), and then see how well it predicts the other two seasons. You might find that it actually isn’t any better than ASPM once you look at out-of-sample seasons.

    • Alex says:

      I could give that a spin. On the other hand, it’s inherently predictive, so it seems unlikely that it’s overfitting that much.

      On a more theoretical level, do you really think that using multiple metrics would be wrong? It seems unlikely to me that one metric would be so accurate, and the others so redundant, that combining them somehow wouldn’t give you additional useful information.

    • Guy says:

      No, I don’t think mulitple measures is necessarily wrong — some combination of metrics might prove to be better than any single metric. I’d expect RAPM to contribute something not captured in the boxscore, for example. But I wouldn’t necessarily expect you would gain much by using 4 different boxscore metrics (including 2 versions of WP!), all of which use the same basic information.

      As for the overfitting, I’m not sure what you mean by “inherently predictive.” But you are really only predicting 120 outcomes here. So it seems entirely possible to me that some of the apparent value you are finding in these metrics is just based on accidental relationships.

      Thinking aloud here, is there any way to learn something by comparing which teams are poorly-predicted by any given metric? That is, I imagine some teams are poorly predicted by all metrics (e.g. they have a lot of rookie minutes). But are there teams that are badly over- or under-valued by one boxscore metric, but not others? If so, maybe the characteristics of those teams can tell us something about the best way to value assists, or efficiency, or rebounds. And if we know which teams RAPM excels on, could that shed some light on what plus-minus uniquely captures?

      • Alex says:

        By inherently predictive I mean that it’s always using prior data to predict future data; it’s always 2010 trying to describe 2011. So the data is already ‘out of sample’. I guess it’s possible that the weights I reported are overfit for these past four seasons, and the relationship between prior years and the following year is different for earlier in the 2000s or will be different in the future. Mainly I wanted to point out that these weights can’t come from the typical meaning of overfit, which would be like applying my explanatory blend to the future, and is apparently overfit because it isn’t also the best predictive blend. Either way, I have my program chugging along on only a couple years to see if the best weights are very different, but it takes a while to go through all the iterations.

        I think that’s possible. I don’t have my script set up to save individual team results, just the average across teams in a given year. But I could have it spit individual teams out as well. My guess is that would be a huge enterprise to sort through though. Once you figured out that RAPM underestimated the 2009 Wizards (or whatever), you’d have to look at all the RAPM ratings for the players on that squad as well as their ratings for the previous season to see who differed from expectation, then also do the same thing for the metrics that were better about the Wizards, then see what those players were like. Doable, but quite a project.

  2. EntityAbyss says:

    Hey, just wondering. Due to the standard errors in the different plus minus measures, does that suggest that the players could have a large range of possible productivity, even in the ones that are consistent year-to-year?

  3. ilikeflowers says:

    Alex, we really appreciate your work, thanks for the efforts. I may have missed this in your post, but have you tested this on random sub-samples of the data to address possible over-fitting? I’m guessing that you can’t since you don’t have the actual formulas for some of the metrics.

    • Alex says:

      I haven’t done it on samples within-season for the reason you say, and I also don’t have a database of game-by-game results. But following Guy’s suggestion from an earlier comment, I’m looking at a couple of seasons alone to see if the weights differ from what I found for the larger sample.

  4. EvanZ says:

    Funny enough, even though Millsap is very good according to many different metrics, I still have some trouble believing it. I just don’t watch him enough, I guess.

    • Alex says:

      I don’t see Utah that often either, but he’s averaging a 20 and 10 per 36 minutes, shooting over 58% (TS), kicking in a couple assists and steals. Seems like there’s a lot to like.

  5. Crow says:

    Thanks Alex. Impressive, interesting work.

  6. Crow says:

    My predictions this season were based on a blend of ASPM, RAPM, EZPM, Vegas and Hollinger (but what he used is either not PER or PER modified / enhanced objectively & subjectively) and my own quick subjective adjustments. Doing pretty well at apbrmetrics’ contest, second best to Hollinger at the moment. I probably should have adjusted the blend testing on past seasons but only took 20-30 minutes for the predictions. I’d assume Hollinger and his staff put 100+ hours into it. I would if I was getting paid for it.

    • Alex says:

      Yeah, I really am curious as to how Hollinger makes his predictions exactly, even regarding the trade machine. I find it hard to believe that PER is the basis for anything he does at the team level, even if he apparently has tried to turn it into a points/wins system.

  7. Crow says:

    Discussion of this contest / thread
    If you want to respond Alex or others just want to read other analysis / opinions.

  8. Pingback: NBA Retrodiction Contest: Blend Update | Sport Skeptic

  9. Pingback: Statophile 26 | “Oh My Jonas” and “In Defense of the Ninja” | Raptors Republic | ESPN TrueHoop's Toronto Raptors Blog

  10. Pingback: Statophile 28 | The New New Things | Raptors Republic | ESPN TrueHoop's Toronto Raptors Blog

  11. knarsu3 says:

    How come Basketball Prospectus’ WARP wasn’t included in the retrodiction? I’d be curious to see how that performs.

    • Alex says:

      Just because I don’t have the numbers. Are they publicly available?

      • knarsu3 says:

        The numbers for 08-09 and 09-10 are available on the player pages but I’m not sure if those numbers are WARP or WARP2. They haven’t been updated in a long time. Or if you happen to have the Pro Basketball Prospectus pdfs, the numbers are available there for the last 3 years.

        • Alex says:

          Hmm. I didn’t know those pages existed. It would be a project to gather all the numbers, but maybe something to get done in the future.

          • knarsu3 says:

            It looks like the player pages have the old WARP. I compared a couple players WARP on the player pages with those in the basketball prospectus’ pdfs and there seems to be a difference.

  12. Pingback: Season Previews 2.0: Rebuilding the Prediction Model for 2014-15 | Arturo's Silly Little Stats v2.0

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s