Guy offered a challenge to me recently which I accepted. It probably wasn’t on par with The Contest or the Battle of Wits (have you heard of Hollinger? Berri? Ilardi? Morons!), but it was fun nonetheless. The challenge was this: Guy would predict player WP48 using their position and rebounds per 48 minutes (R48) while I would predict it using their position and true shooting percentage (TS%). The idea is that if Guy won, it would show that Wins Produced overvalues rebounding. If I won, it would show that it’s perfectly ok (maybe not). The gory details are below, but here above the cut, for all to see, congrats to Guy. But for WP fans, don’t worry; there’s plenty of equivocation below.
Here are the details. I have two seasons (2008 and 2009) of player data from basketball-reference.com that I paired with WP48 and position info from the automated site. I suggested cutting out all the players who had a mixed position listed just for ease of processing, and Guy suggested cutting out anyone who played fewer than 1000 minutes to ensure we were using good estimates (and also because including the low-minute players resulted in weird models, although I didn’t ask what made them weird). This left us with 219 player-seasons for our sample out of a pool of about 880.
Going in, I expected to win because I ran my regression on a bigger set and found that TS% beat out R48. But, in our final sample, R48 did indeed predict WP48 better. Since our dependent measure is the same, we can simply compare R squared values, and TS% has .295 while R48 has .482. I also checked points per shot (PPS) as another measure of shooting prowess (TS% includes an adjustment for free throws taken while PPS does not) and it had an R squared of .304. So Guy won the bet, and congrats to him again, but I was confused. What changed the order of importance of the variables from when I took the bet? I’ll start at the top and work my way down.
If I use the full sample and run our same regressions (position along with one of R48, TS%, or PPS), the R squared values (in that order) are .629, .541, and .421. If I leave position out, the values are .26, .471, and .352. So including position in the model makes a difference; if you didn’t, you would assume that shooting were better predictors than rebounding.
What if I use the full sample but take out all the mixed position players? Zach Randolph, for example, is listed as part power forward and part center. I suggested taking guys like this out for computational ease, although it turns out that I didn’t need to (R is amazing!). It doesn’t make a big difference; the R48, TS%, and PPS (along with position) R squared values are .645, .51, and .354; without position they are .316, .503, and .342. So we have the same story as we did in the full sample.
What if I take out players with less than 1000 minutes but include mixed position players? I get an order of .679, .503, and .503. Shooting takes precedence again if I remove position; the order is .173, .233, and .248. And we know what happens if I take out mixed position players since that was the challenge; if position was removed from the challenge, the order is .126, .280, and .277 and I would have come out on top.
Choosing 1000 minutes as a cut-off was arbitrary; does it matter if I use a different criterion? I took the full sample and cut out players who averaged less than ten minutes per game. The results are much closer; .565, .520, .513 (and of course shooting comes out ahead if position is removed from each model). Removing the mixed position players results in the same pattern. So had the challenge used a different minute cut-off Guy still would have won, but it would have been much closer.
Summary time: using a different minutes-played criterion does make a difference, although not in the final result; however, there’s no ‘best’ way to pick how many minutes a guy should play before we think his stats are reasonably well-estimated. I thought that taking out mixed position players would make analysis easier, although it didn’t matter in the end for either the analysis or the results (although I don’t know what program Guy uses; perhaps it was helpful for him). Instead, the key factor was including position in the models. In the full sample position alone only had an R squared of .1 when predicting WP48 and the regression was not significant, which makes sense since WP48 is standardized to position (the R squared is virtually 0 if mixed position players are removed). If you don’t include position, which is how I think I did my initial check when taking the bet, shooting is a better predictor of WP48 than rebounding. But when position is included, rebounding flies past.
This is odd given that I just said that position barely matters; however, rebounding is fairly correlated with position. Position doesn’t predict TS% at all; the R squared is .157 but not significant. In contrast, position predicts R48 with an R squared of .576. And now we’ve come full-circle back to a point I made in my post on missing variables; when variables are collinear, even ones you haven’t put in the model yet, there is a high potential for changing things drastically when you move from model to model. Position or rebounding alone are fairly poor predictors of WP48, but combined they appear to be great. Adding position on top of true shooting raises the R squared, but not a lot. Another warning sign is that the value of rebounds changes drastically depending on if position is included or not. Without position in the model each extra rebound per minute buys you .02 WP48; with position, it doubles to .043. This isn’t due to any kind of interaction, like rebounds count more for centers; the model now thinks that rebounding is more important than it did before. You get a similar effect with assists per 48 minutes because it also correlates with position; the value of an extra assist per 48 minutes doubles if position is in the model. You can also see the collinearity effect if you compare a model with both TS% and R48 (both standardized to allow for comparison) to a model with those plus position. If position is included R48 has the higher standardized coefficient and seems more important; if not, TS% has the higher coefficient and seems more important.
This is one of the reasons for the quote popularized by Twain: “there are three kinds of lies: lies, damned lies, and statistics”. A lie can be figured out; you can sometimes tell when someone is being sneaky with a damned lie. But with statistics, sometimes it’s hard to tell what’s going on at all. Different analyses, all equally valid, can give you different answers. Changing your minutes played cut-off changes things a little; the seemingly innocuous decision to include position or not changes things a lot. So in the question of which is more important to WP48 (or PER, or adjusted +/-, or team wins), rebounding or shooting, it’s going to depend on what else is in the model. I think the fairest answer is that they’re probably about equally important. But I’m sure the debate will continue unabated.