So that was a long series on regularized regression, the technique behind RAPM. I covered what it is and how it tends to react to collinearity, sample size, noise, and the number of predictors. I thought a bit of a summary might be in order.
Just to summarize each part in short: regularization is a regression technique that puts a limit on the size of the beta weights. It does this somewhat as a side-effect of its main goal, which is avoiding a mathematical issue that is a common consequence of collinearity. This also causes it to provide biased beta weights, in that they are moved away from the typical expected value estimates provided by standard regression.
I mocked up a series of data sets that are not exactly like NBA data, but meant to examine how standard and regularized regression deal with issues that might be important for NBA data. The first one varied the strength of the correlation between the predictor variables. The more correlated the predictors are with each other, the more ‘regularized’ your regression should be, in general. Regularized regression does a better job of recovering true beta weights and predicting new data than standard regression, particularly in the face of strong collinearity. However, due to the bias in the estimates, the error on any particular observation was larger for regularized regression.
This general pattern (worse explanation but better general recovery of beta weights and better prediction) was also true if you varied the number of observations in the data set, and with more observations I saw that more regularization was needed. And it was true yet again when I looked at different levels of noise (i.e. how well the predictors actually predict the Y values), although both regressions obviously performed worse when more noise is present. There was an interesting pattern when looking at the number of predictors, in that as you add more predictors the regularized regression gets worse (albeit still better than standard regression in the way described before).
So, what might we conclude about RAPM based on this general information about regularization? First of all, it should be preferred to APM in virtually all situations. While standard regression (which is the basis for APM) produces smaller errors, those errors are in explaining Y values (for example, points scored while certain players were on the floor). We don’t generally care about that, we care about what might happen in the future (and regularized regression has smaller errors) and what players are ‘worth’ (and regularized regression has smaller coefficient errors).
However, because the estimates are biased and standard regression does a better job of describing already-known events, RAPM may not be best for explaining what has already happened. RAPM may not be what you want to base an MVP vote on, for example, or you would at least want to weigh it less than you would compared to a decision about who would be MVP next year, or playoff MVP. RAPM really benefits when looking forward, and not as much necessarily looking backward.
Regularized regression reacted interestingly to a couple of choices that come up often in APM-style analysis. First, when more observations were used more regularization occurred. This suggests that early in the season, RAPM and APM may not be so different (at least in terms of overall accuracy, not accuracy in regards to any one player). But as more games are played or if multiple seasons are used, RAPM should become more and more different (and presumably better) from APM. However, RAPM didn’t deal with increasing numbers of predictors as well as standard regression. It was still better, so it isn’t like you should suddenly switch to APM if you’re running a multi-year analysis with 1000 players, but it might be better to run single-year RAPM than multi-year.
A question worth asking is, how much like actual NBA data were my mock data? One potentially important way they’re different is that my numbers were all drawn from the normal distribution. APM-style data is all 1s, 0s, and -1s, with the vast majority of the entries being 0 (in other words, it’s very sparse). That could have an impact on how the regressions would respond that I haven’t looked at here. Evan was generous enough to send me a file with the last two-plus years of player data set up in APM style. There are over 70,000 observations (each observation made up of at least one possession where the same 10 players are on the floor continuously), which is many more than I looked at. In the entire set there are over 600 predictors/players, which is about the highest I went up to. As mentioned in that post, I covered the noise range in the data.
And of course the amount of collinearity is important. Evan has been posting a couple of cool ways to look at the correlation in playing time between any two players, but the real issue is how one player’s playing time (being on or off the court) is predicted by all the other players. I used the VIF for that, but unfortunately neither of the VIF functions in R will give me values because the actual NBA data is either too large (I get a memory error) or too collinear (I get a NaN output). So I calculated an R squared for the first 40ish players in the data set one at a time to get an idea of how bad the collinearity is. Even across two-plus seasons of data, the average R squared for those players is .9. The average VIF is 17.5, above the cut-off of 10 that is usually suggested as a sign of ‘too much’ collinearity. And while some players aren’t in too bad a shape on their own (the smallest R squared was .534), you have to look at the entire data set. If some players are too high, they all might as well be too high as far as the regression calculations are concerned.
Overall, I think the mock data probably does a good job of describing what actual NBA data is like. The actual data set is on a larger scale, but not ridiculously larger (just large enough to give my computer/version of R memory issues). If you had a single season of data it would probably be similar to some of what I presented across these posts. One particular thing that might be worth looking at in the future is the spread of beta weights. Here I varied them by how many predictors there were; at NBA-type size, ‘players’ were assumed to range from 1 to over 400. That obviously is not the case in the actual NBA, and it’s possible that standard and regularized regression would respond differently if all 400-plus players were in a range from, say, -10 to 10. But it was fun (for me at least) to look at some simulated data again and to play with regularized regression. I think a decent amount of sports statistics is headed in the direction of being more and more complex, and the techniques used are more and more difficult for people to wrap their heads around. Even if you know how it works, people may or may not be familiar with the boundary conditions for when it works and when it breaks down, or how it responds to different kinds of data. If you’re interested in sports statistics to become a better consumer of sports, you also have to become a better consumer of statistics.