NBA Database Update

I’ve made a few updates to my NBA player stats database, so I’ve reposted the sheet on Google Docs here.  I added a few columns, described below, and caught an error.  The new WP numbers were wrong for players with an apostrophe in their name; that should be all better now.  I realized you didn’t have to convert files to Google format, so now the entire 2000-2011 dataset is there, in Excel format.

There are four new columns, all to the right of the sheet.  The first is PER_EWA, which stands for PER estimated wins added.  You can find this by going to ESPN’s NBA stats page and clicking on the link to the top right that says Hollinger player stats.  The definition is at the bottom; it turns PER into the number of wins added beyond what a replacement level player would have in the same number of minutes played.  If you prefer it in points, just multiply by 30.  I don’t know where he got the formula from, but at least it’s a way to turn PER into wins or points, which will be valuable for seeing how well it predicts performance.  PER_EWA is for the whole season; you’d want to multiply by 48 and divide by minutes to get it to a per-48 minute measure (or multiply by 100 and divide by possessions to get a per-100 possession measure).

Next is NBA Efficiency, which I got from the Wins Produced FAQ page.  This is a generic metric created by the NBA (I think) that is about the most simple linear weights combination you can think of: one point for good things, -1 point for turnovers and misses.  The downside is that there is no obvious conversion to points or wins, and it is probably overly lenient on shooting efficiency.  The third column is pretty similar; it’s the TENDEX rating.  It’s very similar to NBA Efficiency but also takes away points for fouls and missed free throws (rated at .5).  I should note that I calculated these first three columns myself, so there could be errors.  However, PER_EWA matched up pretty well with the ESPN listing for the players I compared, so I think they should all be correct.  Both NBA Efficiency and TENDEX are per-48 minutes.

Finally I have ASPM, which stands for Advanced Statistical Plus Minus and comes from Daniel/DSMok1.  The general idea with statistical plus/minus is to regress boxscore information on some manner of plus/minus rating (such as APM or RAPM).  That way, perhaps, you get the best of both worlds: the ‘captures everything’ aspect of plus/minus but estimated from easy-to-obtain boxscore info.  Daniel’s version works from ‘advanced’ boxscore stats like rebound % and usage.  I don’t know if it’s up-to-date, but there’s some description here.  Daniel was generous enough to send me his numbers and they’re in the database now.  I lined them up by hand, so they should all be correct.  ASPM should be per 100 possessions.

With these new measures, I’ll do a quick update of my previous post to see how they line up with the others.  Again, I dropped any player with under 500 minutes for a particular team.  First, a correlation matrix:

In the rows you have all the measures, and in the columns you have the four new guys.  PER_EWA is a season-total measure, so the correlations there might not be so great; the other measures are per-48 or per-100 possessions.  But it generally correlates pretty well with everything except for APM and RAPM.  NBA Efficiency correlates very well with TENDEX, which is unsurprising; the only place they can differ is where a player’s fouls and missed free throws affect his rating dramatically.  Again, the correlation with APM and RAPM is low.  And the story is pretty much identical for TENDEX.  ASPM correlates fairly well with everything, although APM and RAPM are still the lowest.  But it does achieve it’s goal pretty well; ASPM correlated better with APM and new RAPM better than any other boxscore metric (ezPM beats it at old RAPM).

How predictable are these new metrics?  I looked at that previously using regression.  I’ll do that again, and once again drop old WP and old RAPM and use scaled values.

PER’s estimated wins added is very well predicted by other metrics, to the tune of an R squared of .845.  That’s true even if you take out PER itself; you still get .83.  The big winner in terms of influence is TENDEX once PER is out of the way, while WS48, APM, ezPM, and RAPM contribute little.  Since PER EWA is so correlated with PER and is a season-long measure (i.e., influenced by minutes played), I’m going to take it out of the regressions for the other measures.

NBA Efficiency is crazy redundant, with an R squared of .933.  As you might have guessed, TENDEX does a lot of the work here, being almost 4 times as influential as anything else.  If you take it out, the R squared drops to a still-high .841 and PER is far and away the biggest contributor.

No surprise, TENDEX is also highly predictable.  You can actually predict it virtually perfectly from the other metrics; the R squared is .974.  It’s still .938 with NBA Efficiency removed, with the biggest contributor being PER.  The WoW FAQ seems to be accurate when it says that the changes made from TENDEX/NBA Efficiency to PER are mostly superficial.

Finally we have ASPM.  The R squared here is .8651, which makes it more predictable than the other metrics I looked at in the previous post (granted it also has two predictors that the other didn’t).  However, it seems to be a fairly even blend of the others, which was not true of any of the other metrics.  The range in scaled weights is from -.33 (NBA Efficiency) to .783 (PER).  The weakest contributor in terms of the absolute value of its beta weight is APM followed by ezPM and WP48.  Removing ezPM to allow for more seasons to be used, the R squared stays about the same but PER becomes relatively more important.

Since I now have ASPM included, and it aims to bridge the boxscore and plus/minus world, I also wanted to revisit RAPM and APM.  In my previous post I found that RAPM (the new one) could be predicted with an R squared of .77, but most of that was APM; the boxscore metrics alone could only get the R squared to .398.  With the new metrics involved the R squared is up to .805; removing ezPM leaves it at .771.  Somewhat surprisingly, APM is joined by TENDEX as the biggest predictors.  If I take out APM to leave only boxscore measures, the R squared drops to .462 and the biggest contributor is indeed ASPM.  A similar story plays out for APM, except all the R squared values are smaller.  Interestingly, even though ASPM is the biggest contributor when predicting APM, the R squareds are barely higher than those reported in my previous post.

Quick summary: with four new metrics, each one appears to be a little more predictable than before.  PER, TENDEX, and NBA Efficiency seem to roughly stand in for each other, although at least John Hollinger has provided a connection between his measure and points/wins.  ASPM is interesting in that it is based on advanced box score measures and it aims to estimate non-boxscore measures of productivity.  But the connection between all the measures still seems fairly high.  Enjoy the new data, and don’t forget to let me know if you find an error!

About these ads
This entry was posted in Uncategorized and tagged , , , , , , , . Bookmark the permalink.

16 Responses to NBA Database Update

  1. For APM, the R^2 are capped by the extremely high random error within the APM sample itself–in other words, there is no way to get a very high R^2.

    I wonder what the split-sample correlations for each measure with itself would be? Do you think you could run the correlations of each with itself, splitting by (perhaps) alternating years? Certainly, there would be some random scatter and aging effects, but it would give a very good idea how stable each metric is, which could be very useful.

    • I’ll try to get up a newer description of ASPM version 2 sometime soon. Changes since that old recovered thread: Better data to regress onto (8-yr equally weighted RAPM, courtesy of Jeremias Engelmann), and some simplifications to the equation (rebounding is now split into ORB% and DRB%, and both are linear).

      Here is the ASPM spreadsheet–anyone is welcome to look at it and work with it (you can put a link in the text): https://docs.google.com/open?id=0Bx1NfCUslJwxM2Q1MzFiMjEtNmY5Mi00ZjgxLWIyOTEtODMzMmM4YmQzMmEx

    • Alex says:

      Part of my retrodiction process is a very simple projection, which will basically let me figure out how each measure correlates with itself over years. That should be roughly what you’re asking.

      • Basically, I’m just interested how much of the lack of correlation comes from sheer instability of the metric. APM, for instance, is extremely unstable, with lots of random noise. I expect Wins Produced to be quite stable, perhaps beyond stability of play for the player, because to some extent the rebounding numbers reflect assignments/role rather than quality.

  2. EntityAbyss says:

    Hey Alex, do you know the year-to-year correlation of the new wins produced model?

  3. Pingback: Consistency of Metrics | Sport Skeptic

  4. wiLQ says:

    “I don’t know where he got the formula from, but at least it’s a way to turn PER into wins or points”
    Just an FYI: http://sports.espn.go.com/nba/columns/story?columnist=hollinger_john&page=PERDiem-090325

    • Alex says:

      It’s nice to read the introduction, but that doesn’t say anything more than the formula does. Why divide by 67? “Sorry, that’s what works”. Works for what? Did he run a regression? Just fiddle with numbers until they added up to something like team wins? If he’s noticed that different positions have different replacement player levels of value, does he need to amend the fact that 15 is average for everyone? Why do you divide points by 30 to get wins? There’s no explanation of the methodology.

  5. Pingback: Investigation of recent snubs in NBA rotations « Weak Side Awareness

  6. Pingback: NBA Retrodiction Contest Part 1: What Happened? | Sport Skeptic

  7. Pingback: Players at heart of disagreement between metrics « Weak Side Awareness

  8. Pingback: Advanced statistics agreed about those NBA players « Weak Side Awareness

  9. Sam says:

    I was wondering if you have thought of or plan on adding salaries for each player in each year. Thanks a ton for the work. Love your blog.

    • Alex says:

      In a perfect world, I’d have all sorts of available info for each player! I don’t think it’s listed on any of the particular pages I snag my data from so far, so no current plans to add it. If I come up with a relatively painless way to add it in, though, I’ll do my best.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s