## The Dangers of Missing Variables

This post falls under the heading of stats learning time, but it’ll also be applied to basketball.  Hooray!  A warning: this is a long one, so you might want to grab some food.

A missing variable is one that should be included in a regression but was not.  This causes the model to be mis-specified, which means that the estimates and the errors of the regression weights could be wrong.  This is especially problematic when the variables are collinear, meaning that the predictors are correlated with each other as well as the variable of interest.  This problem relates even to looking at simple correlations between two variables.  In Wages of Wins, David Berri gives the example of offensive rebounds.  It turns out that at the team level, offensive rebounds are negatively correlated with wins; the more offensive rebounds a team gets, the fewer games it wins.  If we took this at face value, Berri says, coaches might be tempted to tell their players to actively avoid getting offensive rebounds.  Of course, this would be foolish; I think anyone who’s played basketball knows that offensive rebounds are good since it means your team has the ball and can try to score again.

The problem with this correlation is that it is missing important variables.  A simple correlation is the same thing as a simple regression; we could run a regression
model of wins = B0 + B1*off. rebounds and it would tell us the same thing as that correlation.  In fact, if we scaled wins and offensive rebounds to Z scores, the slope B1 would be the correlation coefficient we saw before.  The problem is that many other things are important to winning and they have been left out of the model; worse, many of these other variables correlate with offensive rebounds.

I’ll illustrate some of the problems.  Let’s say that I know two variables correlate with each other.  To keep it somewhat related to basketball, I’m going to call these variables rebounds (REB) and field goal percentage (FG).  I can create fake data to represent them by pulling from a multivariate normal distribution like I did in my post on consistency.  To keep things simple I’m going to create normalized variables by setting the mean value to 0 and the variances to 1; the covariance can be any number from -1 to 1 (in this case it’s the same thing as the correlation).  Having made my fake data, I can use it to create a third variable, wins.  Just to pick out numbers, I made the equation wins = .7*FG + .4*REB + E, where E is a random error term with mean 0 and standard deviation .1.  That means that for each of my fake teams I take their FG and REB values and combine them as above along with a random error term to insert a little noise, and that gives me that team’s win total.  Question 1: can regression still figure out what’s going on?

Yes it can.  This is a graph of the regression weights from wins = B0 + B1*FG + B2*REB across all values of correlation between FG and REB (I created 5 data sets at each
level of correlation to get some info that way as well, and I took out -1 and 1 because the regression fails with complete collinearity).  The top set of dots is for FG and the bottom set is REB; you can see that there’s a little noise, but they do a pretty good job of finding the values I assigned regardless of how much collinearity there is.  Maybe that’s just because I put in too little noise?  Here’s the same graph but if I set the random error term to have standard deviation .5:

The data is indeed noisier but things are still pretty well centered on .7 and .4.  It gets a little dicey with near-total collinearity, as it should, but with as much data as I made (1000 observations, or more than 30 ‘seasons’ for 30 ‘teams’) it still does a good job.  So the multivariate regression clearly can figure out the relationship between REB, FG, and wins; both are positive.  How about the simple correlations between REB and wins or FG and wins?

The two graphs above are the simple correlations between REB and wins and FG and wins at each level of correlation between FG and REB, one graph for each level of noise I described before.  The top lines are for FG in each case and the bottom set is REB; I included a straight line on both graphs just for reference.  In each case the simple correlations suggest that FG is more important, but the value of the correlation between FG and wins varies depending on how FG covaries with REB.  In the more noise case, the correlation goes from around .5 to around .8.  In the less noise case, there isn’t even a monotonic relationship; the correlation drops and then rises back up.  The situation is worse for REB; it can take on a wide range of values, including negative ones.  This is despite the fact that both FG and REB have a consistent underlying connection to wins, and both are positive.  The regression ‘knows’ this because it has both pieces of relevant information; the correlation fails because it has only one.  How much it fails depends on how much the missing variable covaries with the included variable.

If I run the same simulation but with the variables more evenly related to wins, the problem remains.  I changed the wins equation to be wins = .55*FG + .5*REB.  Here
are the graphs for the regression coefficients and the simple correlations coefficients.

The first graph shows that the regression does a decent job of picking out the true weights, although with the high noise and the close weights things are noisy.  The second graph shows that the correlation values do tighten up since both variables are similarly related to wins, but they still take on a wide range of values, from virtually unrelated to strongly positively related.

As I said before, the simple correlation is similar to a simple regression.  Here’s a graph that plots the weight for FG in the multiple regression (should be the same dots as the graph above, centered at .55) and the weight for FG from a simple regression.

That would be the line of dots that starts near 0 and rises toward 1.  We know the actual value should be in the neighborhood of .55 (it will vary due to the random noise), so it is clear that the simple regression (and simple correlation) give very misleading impressions of the relationship between FG and wins (as well as REB and wins when we look at that
variable).  In fact, the weight is only correct when the correlation between FG and REB is near 0, or there is no collinearity.  Collinearity leads to what is called ‘bouncing betas'; the regression weights (usually labeled with the Greek letter beta, or I used Bs above) change, sometimes dramatically, depending on what variables are included in the regression.  Since correlations are simple regressions, they fall prey to the same issue.

So, with all this in hand, we can turn to commenter Guy’s question about the disconnect between the correlation of rebounding and wins at the team level and rebounding
and wins produced at the player level.  I got team level data from Arturo’s post here.  This data doesn’t have field goal percentage, or other percentages, because they aren’t used in the calculation of wins produced, but I created true shooting percentage (which basketball-reference.com defines as points divided by 2*(FGA + .44*FTA); it is meant to give credit for threes and free throws).  Now I can correlate total rebounds with the other team-level statistics.  Here are some of the interesting numbers: .294 with wins, .427 with field goal attempts, .372 with opponent field goal attempts, and -.124 with true shooting percentage.  I think these are all fairly intuitive, although I would have guessed a larger negative correlation with the shooting percentage (of course, these correlations are subject to the same interpretation issues I showed in the first half of the post).  So there are some collinearity issues.

Next I ran a regression of wins on each of the own-team stats that are in wins produced with each stat scaled to Z scores to allow for comparison of the weights.  Turnovers have the largest absolute weight at -.617, followed by defensive rebounds (.522), field goal attempts (-.478), and steals (.302).  If I put in true shooting percentage on top of the other stats, it takes on the hightest weight followed by field goal attempts, turnovers, and
offensive and defensive rebounds.  So it appears that at the team level, the most important performance factors are shooting accurately, not turning the ball over, and rebounding, in that order.

How about the player level?  I got player data from basketball-reference.com and the automated wins site for every player from the past two seasons (not the ongoing season).  I combined player data from the automated site if the player was traded during the season, so the WP48 numbers may be slightly off for them due to rounding.  In this data set, total rebounds correlates .418 with WP48, .728 with wins produced, .773 with Win Shares (since I got the data at basketball-reference.com), .37 with field goal percentage, and .358 with true shooting percentage.  What’s particularly interesting here is that rebounds correlates pretty highly with most of the statistics, and that’s of course because players who play a lot of minutes get a lot of all the stats.  Similarly, if I look at what correlates with wins produced, the top number is Win Shares, then rebounding, free throws, points, field goals, and minutes played.  WP48 has a .557 correlation, but is 15th on the list.  All the
percentages bring up the rear (except for age, which is nearly uncorrelated with wins produced, which is interesting). All of the count statistics correlate highly with each other, wins produced, and Win Share because of the playing time issue.

If I adjust all the count stats to per 36 minutes played values, true shooting percentage is the top correlate for WP48 followed by field goal percentage, wins, rebounds, and Win Shares.  So it looks like shooting well is key for having a good WP48.  Looking at total wins produced, the order is Win Share, wp48, games started, points, free throws, and true shooting percentage.  Total rebounds per 36 minutes only has a .176 correlation with wins produced.  So far, what we’ve seen is that rebounding has a variety of correlations with wins and wins produced depending on if we’re at the team level or the player level, or the player level adjusted for playing time (as is true for most of the count statistics).  Given what we saw before, these varying correlations alone should be a warning that their values may be misleading or incorrect.

Of course, we spent all this time discussing how simple correlations are not to be trusted.  So I ran the scaled regression described above but at the player level with wins produced as the dependent variable.  Of the 12 independent variables, defensive rebounds are 4th and offensive rebounds 10th, behind main drivers field goal attempts, two point field goals made, and three pointers made.  So scoring appears to be the main driver of wins produced.  If I add true shooting percentage, the order is unchanged because true shooting percentage has virtually no effect on wins produced when the other variables are in the model.  Doing the same thing for wp48, the order is total field goal attempts, total two point shots made, and three pointers; defensive rebounds fall to 7th most important.  Adding in true shooting percentage puts it in third behind field goal attempts and two point shots made.  Finally I ran the regressions with per-36 minute values as the predictors.  Defensive rebounds is the number one predictor of wins produced (finally Guy is vindicated!), but if true shooting percentage is added to the model it drops to fifth behind scoring variables.  Looking at wp48, scoring variables dominate regardless of whether true shooting percentage is included or not.

That was a lot of verbal descriptions of data.  Here’s a summary.  When collinearity is present, correlations should not be trusted.  Even so, rebounding does not correlate highly with wins produced or wp48 unless you look at total number of rebounds, which is then inflated due to playing time.  If you look at Win Score, which doesn’t use the much-maligned values for rebounding that wins produced does, defensive rebounds
has a pretty high .804 correlation, so if one were to make the argument that rebounds affects wins produced too much, they would also have to argue that it isn’t due to the weighting of rebounds in the wins produced calculation.  In fact, wins produced and Win Score have a .892 correlation.  As mentioned before, this is because players who play a lot accumulate more wins (or Win Score points) even if they aren’t necessarily efficient.  Returning to the main issue of why the relationships may differ at the team and player level, I think it’s because there are a variety of steps taken before determining a player’s wins produced.  It isn’t a simple weighing and summing of box score stats; production is then adjusted for playing time, teammate productivity (to account for diminishing returns), team defense (where everyone gets credit for generating missed shots by the opponent, and position played (all of this is illustrated here).

Instead of looking at correlations, one should look at weights from a scaled regression; this allows for variables to be placed on equal footing and fairly compared.  Using only the box score stats that go into wins produced, a rebounding variable is only the most important driver in a scaled regression when you predict total wins produced from per-36 minute stats.  However, even this goes away if true shooting percentage is added to the model.  Across all the regressions I checked, the results show that wins produced and wp48 are driven by scoring variables.

This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

### 7 Responses to The Dangers of Missing Variables

1. EvanZ says:

Have you thought about looking at the correlation of the “Four Factors” and WP?

2. Guy says:

Wow, that is a lot of words. Let’s try to narrow the discussion by looking only at per-minute stats — otherwise playing time makes it very hard to interpret findings. Could you post the independent variables you used and the resulting coefficients for whatever you consider to be your best model for predicting WP48?

Then you need to take one important step that it appears you neglected, which is position-adjusting Reb48. If you don’t do that, it’s impact is hidden because WP48 does include a position adjustment. Once you do, I think you will find that position-adjusted Reb48 is indeed a very powerful predictor of WP48. (Or alternatively, you could use AdjP48 as your dependent variable, and then you will also see the dominant role of Reb48).

And I do hope you will find the time to run that simulation to estimate team variance in rebounding, using position-specific means and SD. Whatever else we may disagree about, I think we both understand that will give us a good estimate of how large diminishing returns on rebounds actually are. If it turns out they are relatively small, as Dr. Berri believes, then there’s no point in even arguing about whether it’s possible to aportion credit in a better way. But if the diminishing returns are on the order of 60-80%, as some believe, then I think you would agree it’s worth considering alternative ways of allocating credit for rebounds.

• Alex says:

Not even a ‘thanks for correcting my misconceptions on correlations’ ? Right back to work. It’s enough material that I’ll do a post.

3. Guy says:

Alex: I can’t tell if you are being serious here. What misconception do you think your post helped clear up?

On the team level, I had a hard time following what you did. But fortunately EvanZ has done two nice posts recently breaking down the relative importance of each factor in explaining team wins. He shows that rebounding (reb%) accounts for about 15% of the variance in team point differential, and is less than 1/3 as important as shooting. If you disagree with that (and I’m not sure if you do), I hope you will explain how your results differ.

Turning to the player level, you report several different findings but without a lot of numbers, and again I confess to being a bit lost. Let me tell you what I see, and you can tell me where your results diverge. Using Arturo’s 2009-2010 data (players with at least 1200 minutes), with WP48 as the dependent variable, these are the standardized coefficients I get for PPS (Points/FGA+.044*FTA), Reb48 (position-adjusted), and other relevant variables:
STL48 0.024
BLK48 0.014
TOV48 0.001
PF48 -0.029