Welcome to the APBR Board!

My retrodiction post apparently got its own thread over at the APBR board.  I would say I’m honored, except it seemed to get treated pretty harshly.  I’ll try to address the issues that were brought up over there.

Going in order of the posts: J.E. wondered where I got old RAPM score from as far back as 2000 or 2001.  He’s right that they don’t exist; they’re accidentally in my database as 0 instead of NA from 2000 to 2005.  So the errors listed for those seasons should actually reflect what happens if you assume every player is average.

I guess I didn’t really expect people to click back through the links like I asked, but it would have been nice.  J.E., I don’t have older years of RAPM because starting around 2006 the names are terrible.  I wrote a code to line up all these different sources but people use different names; the RAPM files start using only first initials, sometimes no first name, sometimes the first name is first and sometimes it’s second after a comma.  I fixed some by hand, but I ran out of patience after correcting 2006.  I will happily include earlier years if naming is standardized.  The seasons I do have of new RAPM were downloaded around Christmastime, so unless he’s changed the algorithm since then, they are all current.

I did make a mistake when I presented the averages; I should have done them over equivalent seasons for the different metrics.  My bad.  But I presented a table with all the individual numbers, so you can do as mystic did and get whatever averages you would like yourself.

I’m not sure that I really buy J.E.’s claim that whole seasons are too coarse a measure. There are, after all, whole other threads dedicated to making season predictions and J.E. is involved in them.  That being said, if I had the game-by-game data to do predictions at that level, and the time to create each different metric for every player at the game-by-game level, I would certainly do that as well.  I don’t know that you would expect the results to be drastically different (could one measure do better game by game but not do better for the season overall?), but I could do it.

As mentioned several times throughout these series of posts, this is the simplest level of prediction possible.  The goal is to put the metrics on an even level, not to be as accurate as possible in making predictions.  So yes, I could use regression to the mean.  I could account for age, or minutes played last year, or any of however many factors influence season-to-season changes in production.  I chose not to.

The point about ASPM being based on multiple-year RAPM is a valid one.  It could give ASPM a benefit, since the weights used to calculate the ratings ‘know’ what the connection between box score stats and RAPM will be in future seasons.  Of course, RAPM also uses multiple seasons of data, whereas none of the other boxscore measures do.  So I guess some metrics get their own little advantages.

Moving down to mystic’s post: I do appreciate cleaning up the average-across-seasons issue.  In terms of using different rookie ratings, again this was supposed to be simple.  I do think it’s interesting that he thinks the boxscore metrics get an advantage from using the actual values.  Do non-boxscore metrics do a worse job of analyzing rookies?  Why should they not benefit from getting in-sample information?

Adjusting for strength of schedule could change the results, but I don’t know how big of an effect it would be.  Maybe some metrics would benefit more than others by being compared to SRS, I don’t know.

Moving down a bit, J.E. seemed shocked that I would use actual rookie production instead of a general assumption.  This time I indeed only used actual production.  The last time I did this I did it both ways; the results don’t change much beyond the predictions being generally better if you know how the rookies will do (not surprising).  I guess this does break a sacred rule of being a prediction.  On the other hand, again, I don’t know why you would think that knowing rookie production would benefit one metric over the others.

That covers about all the content-related posts so far.  The (current) final one is another by J.E.  He says that many of my replies to comments are unkind.  Let me know if this is true; I try to keep an even tone as much as I can, or to at least respond in kind.  Maybe I’ll start having some warm milk when I check the comments.  As to whether or not I’m making an honest attempt to find the best metric: I would like to see other work making a better attempt.  I make no claims that what I’ve done so far is perfect, but I’ve attempted to be fair and I’ve presented all the details.

Perhaps I have given Wins Produced more benefit of the doubt than others would, particularly at the APBR board, but outside of the mistakes I’ve mentioned I don’t think I’ve done anything out of line.  I was harsh to APM, PER, and the older version of RAPM, but I’m not sure what else to say.  APM did terribly.  I haven’t read anything at the APBR board to make me think that would be a surprising statement.  Even the people that support it, as best I can tell, say that you need to use multi-year APM, and this is single year.  Similarly, I don’t know why it would surprising that PER would do poorly.  I don’t think that anyone uses it for player evaluation besides ESPN, and I’ve seen the same comment on the board.  As for the previous version of RAPM, with all those mistaken years averaged in it looked bad.  That being said, even in the past four years it wasn’t great.  Regardless, it’s clear that the new version with the previous rating as a prior does wonders for the ratings.  I would think that was clear, since you can’t even find the old ratings on the website any more.  I didn’t think I was saying anything that wasn’t common knowledge.  Maybe I can get some points for not making any ad hominem attacks?  At any rate, I appreciate the feedback.  Hopefully I can improve my future projects.



This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

25 Responses to Welcome to the APBR Board!

  1. Jerry says:

    good post

    The bad names in for 2006 and earlier come from basketballvalue (this file http://www.basketballvalue.com/publicdata/players20060928.zip to be exact). If it helps you, I can post the basketballvalue ID with the name, or the team name in addition to each player name. The only names that should be weird, though, are those who belong to players that ended their career in 2006. Everyone else’s names should be fine.

    “I’m not sure that I really buy J.E.’s claim that whole seasons are too coarse a measure. There are, after all, whole other threads dedicated to making season predictions and J.E. is involved in them.”
    In our newest season prediction thread I very recently made a comment in the where I said that it’s not a valid way to compare metrics when you don’t have the actual minutes,
    I posted predictions there because it’s fun to make predictions. In the retrodiction thread(s) where I tested various metrics I always presented metric performance on possession level. If possession level isn’t possible or unfair one should definitely go with single games instead.

    “That being said, if I had the game-by-game data to do predictions at that level”
    MPG for each game are available at bbr, possession data for each game is available at bbv.

    “could one measure do better game by game but not do better for the season overall?”
    That’s certainly not impossible.

    “The goal is to put the metrics on an even level, not to be as accurate as possible in making predictions” One metric could be very bad without regression to the mean, and great with it.
    Would you call it a bad metric then? Even when we know it gives best predictions once its’ ratings have been regressed to the mean? It’s important not to use future information
    when deciding how much we want to regress to the mean though, so year #1 that gets retrodicted should definitely not have any regression to the mean.

    “Of course, RAPM also uses multiple seasons of data, whereas none of the other boxscore measures do. So I guess some metrics get their own little advantages”
    That has been brought up before and the statement always makes me cringe. Every metric is allowed to use as much past information as it wants, it’s not RAPM’s fault that no other
    metrics use all data available. You want to knock RAPM because it plays by the rules?

    “I don’t know why you would think that knowing rookie production would benefit one metric over the others”
    It doesn’t matter who benefits most from that, it’s just something not allowed when doing true retrodiction.

    • Alex says:

      Hi Jerry – Sorry your comment got caught in the spam; it lets comments through if the person has had a comment approved in the past but otherwise needs me to check it, and I wasn’t on the site yesterday.

      Having more consistent names or additional information would of course be great. As I said, it only really seems to be a problem in 2006 and moving backwards (I’ve noticed that for 2005 some names are First Last and others are Last, First). So if five years, and soon to be six, is sufficient for everyone else it’s fine for me. For this data set I worked from basketball-reference; I don’t know if they have a ‘preferred’ naming convention. I think each of my sources used something a little different.

      I agree that the minutes issue makes prediction a poor condition for comparing metrics. That’s why I did it the way that I did. I’m also aware that game level and possession level data are out there, I just haven’t downloaded them yet. Beyond just getting the data, I would also have to calculate each metric for each player after each game, which would not be trivial. From my understanding of Neil’s recent post he did game-by-game but used Game Score and similar measures that are not necessarily the metric per se. If I had created an ASPM game score, with slightly different weights and no adjustment to league average and whatnot, I’m sure Daniel would have complained. So I would need to do a fair amount of work, after getting the data, to go through such a prediction. It just isn’t something I’ve gotten myself into yet.

      I suppose it’s possible that regression to the mean could help some measures more than others. I think my next big project is going to be looking at age curves, so after that (and presumably with the current season in the bag) I can go through this again and see if there are big shifts.

      The statement about RAPM is not to knock it, it’s just to point out a caveat. If I had used multi-year APM or if Evan used multiple seasons to create ezPM, I would mention that as well. I’m just pointing out that it gives RAPM a leg up, the same way you thought that ASPM has a leg up.

      • Jerry says:

        For everyone that ended their career before ’05 the names are listed “first name SPACE last name”. The entire list is here http://stats-for-the-nba.appspot.com/PBP/players05.txt
        Let me know if the file is enough for you to line up those names, or if you need me to change the names.
        Most of the players that ended their career in ’06 will have weird names. I’ll try to put the team name in front of these names soon.

        “I would also have to calculate each metric for each player after each game”
        What? Why? I assume that now you get end_of_season win predictions by multiplying minutes with player rating for each player of one team, leading to some type of “wins created” by each player (I guess in case of RAPM it’s more “point differential created”), then sum up all player “wins created” for one team to get that team’s projected wins, yes?
        All I’m saying is that you should do the same thing, but instead of player minutes for one entire season you enter player minutes for each game, then get projected game
        point differential out of this, which needs to be compared with actual game outcome. This obviously needs to be done with all 1230 games in a season.

        • Alex says:

          Right, but for the prediction to actually be predictive, I would want to know the players’ ratings on each metric going into the game, not for the entire season. If I’m going to predict who should have won the Pistons-Celtics game on January 19 last year, I should use the player ratings based on the data up until January 18th, not their ratings for all of 2011, right? Which means I would have to calculate every player’s rating for each metric after each game (or perhaps each night of games). Then I could multiply that by their minutes played and create a prediction as you say. I believe this is what Neil did for his piece on the Prospectus site, but he used the ‘shorthand’ version of some measures, like Game Score, which are not the same as the actual metrics. It would be much more complex for me to calculate WP or WS, let alone APM or RAPM, running through game by game.

          • Jerry says:

            Well, if you don’t update after each day you’d simply be testing the metrics ability to forecast point differential of all games of the season, instead of whole season point differential. Yes it would be perfect if each metric would update after each day, but that doesn’t mean you can’t test the metrics using single games.
            Here’s what I don’t like about whole season point differential(I’ll try to make my point with a thought experiment): Say you have the perfect metric and one bad metric. Let’s assume they rate everyone on last year’s Celtics team a zero, except the perfect metric gives Garnett a +2 and Davis a -2; the bad metric gives Davis a +2 and Garnett a -2.
            Let’s say they play an average team. Lineup is
            (1) Rondo(0)Allen(0)Pierce(0)Garnett(X)Perkins(0) vs average opponent
            for an entire game. Garnett gets injured for the next game and Davis has to play in his spot
            (2) Rondo(0)Allen(0)Pierce(0)Davis(X)Perkins(0) vs average opponent
            The perfect metric says (1) wins by 2, lineup (2) loses by 2. The bad metric says (1) loses by 2, and (2) wins by 2. Now, (1) does indeed win by 2, and (2) loses by 2, so basically the perfect metric gave a perfect prediction.
            If those games were the only games in the season, both metrics would predict the Celtics to have a season point differential of 0, and both would be right, and you couldn’t tell which one is the perfect metric and which one is the bad metric.
            That’s why it’s best to this kind of prediction on data sets which include no substitutions, which would mean you’d have to do things per possession. If that’s not possible it’s best to go with the next precise measurement, which would be point differential per game. Season point differential is the least precise of them all

          • Alex says:

            Good example, although I’ll have to think about how much it relies on such a symmetric situation. I still think using whole-season rating to test each game fails the ‘prediction’ test, in that it isn’t actually a prediction, but might be worthwhile as a stop-gap.

          • Jerry says:

            I don’t get why you think predicting games “isn’t actually a prediction” but predicting season differential is. Both are predictions. You’re just predicting different things, and game differential is better able to identify the better metric.
            Another benefit of doing game differentials is that you can actually test your findings for significance. You sure won’t get p-values <0.05 when comparing two metrics for four years, which leads to four error terms in your case. If one metric gets errors of (2.7, 2.9, 2.7, 2.8) and the other gets (2.8, 3.0, 2.8, 2.9) the difference will never show up as significant because sample size is so tiny. When predicting games you get an error vector with 1230 entries.

          • Alex says:

            I don’t think that predicting the outcome of a game from 2011 using ratings from 2011 is a prediction. That’s all. That’s why I’ve been emphasizing having the rating game-by-game if I were going to do it at that level. If I use 2011’s rating, it has some knowledge of how the games in 2011 play out; it isn’t using only past knowledge. If a player started out hot and then cooled over over the course of a season, I would get very different game-by-game predictions depending on if I used his overall season rating or a game-by-game rating. I have no argument with game-level providing advantages over season-level per se.

            I could run a non-parametric test, although across four seasons it would still be pretty sparse. I could also run the error across team-seasons, in which case with four years I would have more like 120 points and that would be pretty reasonable. Even at the season level, I can get significant results for the metrics with eight seasons of data if one is better every year. If I wanted to run stats, I have options.

          • Alex, I’m pretty sure Jerry means using data *prior* to the game being predicted.

          • Guy says:

            I may be misunderstanding one of you (or both!), but I think Jerry is suggesting that you would use 2009-2010 metric data to predict game outcomes in 2010-2011 (for example). It sounds like Alex thinks the proposal is to use same-year metrics to predict games within that year, which would not be a true “prediction.”

          • Alex says:

            Yes, that was my point at least.

          • Another possibility is that Jerry means something like leave-one-out prediction, where you determine the ratings for all but one game, and then predict the differential of that game.

  2. Crow says:

    Thanks for the replies, which I have linked to.

    I’ll try to remember to come by more often to see your next works.

    I appreciate the depth, effort and neutrality or near-neutrality.

    I generally try for neutrality or near-neutrality in this debate myself but maybe not always and maybe not always in the eyes of others.

  3. mystic says:

    Alex, thanks for the reply. You shouldn’t be bothered too much by the way the criticism is written. J.E. and myself aren’t native speakers and it seems as germans have the tendency to speak their mind more freely without wondering whether someone else could be offended by this. So, sorry, if it seems to be too harsh, no personal attack was intended, at least by me.

    Anyway, the issue with the rookie ratings seems rather obvious to me. A metric, which can say a lot in hindsight, but is not pretty good as predictor will have a big advantage here. For example, WP has a really small error in your hindsight test. By choosing the <100 min data from the current season it will for sure improve the average error more. Well, you can test my hypothesis pretty simple by replacing those <100 min player values with the some sort of average value like the -1.92 J.E. suggested. You can test how the different metrics react to this. I suspect that especially metrics with a strong explanatory factor will see the average error raised stronger.

    RAPM has indeed less explanatory power, because it is based on prior seasons. Rookies or players with low minutes will have lower values in average, because the model is not as confident about their value as for players with more minutes (well, we should rather say possessions than minutes, but anyway). That will screw up the result further. Non-prior informed RAPM will likely have more explanatory power and it will probably show also a lower average error (it is more regressed to the mean) for the predictions.

    The different amounts of seasons is a big issue. You really can't compare the 10yr averages with the 4 yr average. Outliers in the smaller window will have a much bigger impact than in the bigger one. And as you saw in my quick calculations it makes a big difference for the conclusion.

    Thanks anyway for putting in the effort.

    • Alex says:

      When I went through this same exercise previously, I ran it with both actual production and a baseline rookie estimate. Some measures did indeed improve while others got worse, but I don’t think it was a large effect or changed the order of how metrics did at all. But it’s certainly true that an effect exists.

      • mystic says:

        You reported the numbers with two digits, I guess the difference where at least of that magnitude. If I see it correctly, you gave all players without an APM value also a constant. Why haven’t you done that with all other metrics too? Meaning, all players who did not have a APM value would also get the same constant for that said players. That gives you a hint how good APM is among known players.
        I would argue that the metric with the biggest advantage by choosing your setting for the predictions are that with the biggest difference between explanatory power and predictive power. Would be nice, if you can just report those numbers for all tested metrics.

        • Alex says:

          I didn’t give them a constant because they have actual, calculated values. APM (or at least the way basketballvalue calculates it) assumes that there’s a baseline level of performance and that everyone under a certain number of minutes performs at that level. Everyone else’s ratings are inherently relative to that level. All the other metrics, even RAPM even though it’s based on the same data as APM, directly estimate each player’s performance. If I gave a group some constant level on the other metrics, it would throw off the ratings for everyone else. The other way to go would be to simply give an APM estimate for every player. My understanding is that this is perfectly doable but adds more colinearity issues. Since that’s already a pretty big problem for APM I don’t know why they don’t just bite the bullet and do it, but at any rate I don’t see a reason to handicap all the other metrics.

          I grabbed the explanatory and predictive error for the last four years for each metric, except APM (only three years of predictions) and ezPM (one year). RAPM had an average difference of 1.92, old RAPM 1.47 (I don’t have an explanatory error for 2011 for old RAPM, but I gave it a favorable number), ASPM 2.26, PER 1.21, Win Shares 2.22, old WP 2.77, and new WP 2.56. So I guess you’re saying that WP has the most advantage from my choices? That doesn’t seem obvious to me though; it has a large spread because its explanatory ability is the best. PER has the smallest spread because it’s bad at explaining and bad, but relatively less so, at predicting. If I do the three years of APM it has the largest spread, but that’s because it sums up close-ish to current performance (only being off because of the replacement level players) but is atrocious at predictions. I don’t think my choices would benefit both PER and old RAPM while penalizing APM and WP, but maybe I’m not seeing something.

          • mystic says:

            Did you use 2yr APM or just APM? And you are using values which are UNKNOWN in order to basing a prediction on it. No idea, but especially for a metric like APM, where you have a baseline value for all those players, a comparison seems a bit useless. I don’t want to imply that APM is good or not, just saying that a bigger error does not really suprise me given the circumstances of the test.

            Yes, my assumption is that WP benefits the most from such a procedure. You can easily test my hypothesis by using fixed values for all relevant players. And given the fact that you tested it a bit, but not reported the numbers, seems a bit odd to me.

          • Alex says:

            Just APM, again because the box score metrics only use one year of data. I’m sure that using 2 year APM would make it better at prediction. Is it possible to download 2 year APM somewhere? I didn’t see it in what I got from basketballvalue, unless the APM listed there is actually 2 year and not 1.

            I’m not sure what you mean by ‘tested it a bit’. I only ever used actual values for players when available, with the exception being how I treated rookies in a previous run-through of this analysis. The main piece of that is here: https://sportskeptic.wordpress.com/2011/06/30/retrodiction-contest-update-part-2-the-results/ .That was done on a data set gathered in a different way, so I can’t promise that what I did then and what I did now are perfectly parallel.

            Could you spell out how what I did would specifically help WP more than the other measures? Why wouldn’t Win Shares or PER gain the same benefit?

  4. Just a couple of quick comments:

    As an admin of the APBR board, I apologize for unpleasant remarks over there. J.E. was being a bit harsh with his language. I appreciate your effort and transparency in this undertaking.

    I wanted to make a quick point regarding the “explaining what happened” part of your analysis. I believe you were comparing to team efficiency differential? If so, ASPM, for one, would have an error of 0–if I summed to that. I don’t, though–I sum to SoS-adjusted team efficiency differential, so the error reported is artificially high because of that discrepancy.

    • Alex says:

      Good point. I think others have made a similar point about APM or RAPM, since they inherently take opponent into account. All of the measures have a little bit of artificial inflation just due to rounding; WP should obviously sum to 0 as well.

  5. EvanZ says:

    Alex, did you have a post where you made a correlation table for all the different metrics you’re looking at? That would be interesting to see.

    • Alex says:

      Yes, I had it here https://sportskeptic.wordpress.com/2011/12/28/nba-metrics-can-we-all-get-along/ . Some of the numbers may be off because I had 0s instead of NAs for old seasons as mentioned. I also didn’t have all of the metrics at that point. Maybe an update with cleaned-up numbers would be in order.

      • Thanks! That’s also where you had some PCA. It’s interesting that old WP48 correlates better with new RAPM than new WP48 does. It looks like new WP48 had the worst correlation to new RAPM.

        • Correlations to RAPM
          (Weighted R, using RAPM over 03-11 and comparing to weighted average of each metric over that time, for all players over 3000 possessions, weighting by possessions):

          Listing correlation to RAPM, ORAPM, DRAPM

          PER: 0.676, 0.744, 0.128
          Win Shares: 0.729, 0.774, 0.671
          Wins Produced: 0.616, 0.466, 0.376
          ASPM: 0.792, 0.857, 0.702

          Note: ASPM was created to maximize this very correlation, so it is most definitely over fit.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s