## Regularization: The R in RAPM – Background

One of the preferred NBA metrics around right now is RAPM.  The R stands for ‘regularized’, and the improvement over APM is that regularization helps with collinearity.  I’m not going to be looking at actual NBA data (sorry), but I wanted to look into how helpful regularization is exactly.  So across a couple posts I’m going to lay out what regularized regression is and how well it deals with collinearity in the face of a few other variables.  It’s going to be math-y, but I think worthwhile for understanding the strengths and limitations of RAPM and the method in general.

To explain what regularized regression is, we have to start with plain ol’ linear regression.  In short, regression exists to try and tell us what the relationship is between one variable (Y, the dependent variable) and some number of other variables (the Xs or independent variables).  The general equation is Y = B0 + B1*X1 + B2*X2 +… +e.  Y is whatever we want to explain; in the APM case for the NBA, it is the point differential for a portion of time where one set of ten players was on the floor.  The X’s are what we think might explain why different observations have different Y values; in the APM case each X is the presence or absence of a given player.  The idea is that a player brings a certain value to being on the court, and so his team’s score differential will be due in part to if he’s in the game or not.  The B’s are the beta weights and tell you how much a change in X affects Y.  In the case of APM a player can be on the court or off, so X is either 1 or 0 (a slight simplification).  If a given player gets a beta weight of 2, that means his team’s point differential should increase by 2 if he’s on the court as opposed to off it.  Finally, e is the error; it’s just the difference between the prediction made by adding up all the B*X’s for a particular observation and the actual Y value. EDIT: there are a couple of other basketball stats summaries that do a more thorough job of describing APM, which uses the same data set-up as RAPM.  Off the top of my head, you can take a look at Daniel’s and Evan’s.  I (humbly) suggest that I’m going to be a bit more thorough with RAPM over the course of the next few days though, so make sure you come back!

That’s a quick and dirty explanation of regression; hopefully you’re already familiar with it.  The goal of regression is to minimize the sum of the square errors; we compare our predictions from B*X to the measured Y values and play with B until those errors (technically, the errors squared for each observation and then added up across observations) are as small as we can get them.  This is usually done with matrix algebra, as the notation is much simpler that way and computers are good at playing with matrices.  The issue for APM is that part of that matrix algebra involves taking the inverse of a matrix related to X.  If the X’s are collinear, meaning correlated with each other, the inverse becomes poorly defined and hard to estimate; in the extreme, the inverse may not exist.  That means that the beta weights you get out will be finicky, to be polite.  They become very sensitive to the details of your data; if you ran the regression again but added or removed a predictor, or added or subtracted some observations, the beta weights could change dramatically.  The errors on the beta weights, which are used for determining if a weight is significantly different from 0, will be very large, meaning that you can’t be very certain about the value of a given player.  The beta weights themselves can even be inaccurate.  Collinearity, especially when there’s a lot of it, is a bad deal.

This is bad news for APM, because NBA players are pretty collinear.  Players tend to play together, like starters for example.  They get many of their minutes with each other.  A player and his back-up will also be correlated, although negatively so; when one is on the court, the other is usually off.  As a side note, collinearity is strictly a property of the X matrix.  Regardless of what Y is, if you have the same X matrix your data will have the same collinearity.  So you can try to have adjusted point rating, or adjusted rebounding, or adjusted whatever and the estimates you get out will all be equally untrustworthy.  So how you can fix a collinearity problem?  One way is to add more data.  If your new data lowers the correlation between X’s you’ll have made things better.  With APM, you can have additional seasons and the collinearity should drop due to players switching teams or simply having more time spent in different line-ups.  Alternatively, you can try regularized, or ridge, regression.

What regularization tries to do is fix the problem with the inverse matrix.  You add a new term to the inverse which makes it more stable.  This has the effect of changing what the regression minimizes.  Instead of just minimizing the sum of the squared errors, you minimize the sum of the squared errors plus the sum of the squared beta weights times some factor (called lambda in later posts).  If that factor were 0, you would just have the sum of the squared errors and it would be standard linear regression.  Otherwise, the larger that factor is the smaller the beta weights will be; they all get shrunk towards 0 (you could also shrink them towards some other number; Jerry’s ratings use a player’s rating from the previous season).  The value of that factor is usually found by cross-validation, which means that you run the ridge regression on a portion of your data with some value, get the beta weights, and then see how well those weights work on the left-out portion of your data.  You wiggle the value around until you get the best weights for predicting left-out data.

In general, ridge regression is an improvement over standard linear regression.  But how much, and under what circumstances?  In the next few posts, I’m going to look at a few different factors and simulate some data to compare the two.