## Regularization: The R in RAPM – Collinearity

As mentioned in the background, the main benefit of regularized regression over standard regression is how it deals with collinearity.  When your predictors are correlated with each other, standard regression becomes unreliable but ridge regression attempts to work around the problem.  Here I’m going to take a look at the role that collinearity plays.  This one is going to be a little long because it sets up the whole simulation, but it means the future posts will be shorter.

For the data, I’m going to mock up a data set so that I can control the collinearity.  I’m going to do this in R, and I’ll paste the code at the bottom so people can play with it if they want.  To keep things simple, I’m going to use four standardized variables, meaning they all have a mean of 0 and a covariance of 1.  We need to make a decision about how correlated my variables will be.  To examine how ridge regression is affected by the amount of collinearity, I’m going to vary the correlation from 0 to .99 in steps of .01.  To keep things simple, I’m going to make each X correlate with each other X at the same value.  When the program draws a sample the means, covariances, and correlations may not quite be 0, 1, and .99 (or whatever the value is for that simulation), but they will come from a population with those characteristics.  To make up for that uncertainty and get more observations, I’m simulating a data set at each correlation level five times.

Next we have to talk about measuring how collinear the data are.  When you talk about two variables, it’s easy to use the correlation.  However, when we have a number of X’s, the issue is worse because any given X could be a combination of any number of the other Xs.  That’s all you need to get that inverse problem.  To look at the collinearity in the X matrix, you have a few options but perhaps the most straightforward is the variance inflation factor, or VIF.  The VIF for a predictor X is related to the R squared of predicting that X from the other Xs; essentially you pretend that one of the X’s is now your dependent variable and you run a regression predicting it with all the other Xs.  If you have collinearity, the Xs will predict that X well and the R squared will be high.  This results in a high VIF for that X.  You do this for all of the Xs and you get a VIF for each one.  There is no set rule for what VIF is too high, but I often see 10; that corresponds to an R squared of .9.  If you have variables with a VIF over 10, you have collinearity issues.  In our simulation, I’m going to look at the mean VIF across the Xs, and we expect it to be higher when the correlation is higher.

Of course, we have to have something to predict.  Once I have my X matrix created, I’m going to create my Y variable by picking weights ahead of time and then adding some noise.  The more noise there is, the worse the regression will be.  But critically, since I pick the weights for creating Y, I know what the results of the regression should be.  A good regression will give me back the weights I used.  Again, to keep things simple, I picked weights of 1, 2, 3, and 4.  So a given Y value is found by multiplying the value for X1 by 1, adding it to 2 times the value for X2, adding 3 times X3, adding 4 times X4, and then adding some random noise value.  For example, if my first observation had X values of [-1,-1.5,-.5,0], the Y for that observation would be -1*1+-1.5*2+-.5*3+0*4 = -5.5 plus some random number.

Next is the actual regression part.  Since part of the appeal of regularized regression is that it is supposed to be better at predicting new data (remember that standard regression has issues with new data when there’s collinearity), I’m going to run a standard regression and a ridge regression on half of my simulated data.  The ridge regression will pick a lambda value (the factor that I mentioned last time) based on cross-validation within that half data set.  I can then see how good those regressions are by calculating the mean absolute error of their predictions within that data set; this isn’t quite the sum of squared error that regression minimizes, but it’s similar.  I can also look at how far off the beta weights are from their true values (1,2,3,4).  Finally, I can use the beta weights from the first half of the data to predict the Y values for the second half of the data, and again look at the mean absolute error.  This is where ridge regression should shine.

Now you start getting some pretty graphs.  Remember that I’m varying the correlation between the Xs here, but our real measure of collinearity is the VIF.  This is a scatterplot of correlation by mean VIF:

I added a horizontal line at VIF=10 and a vertical line at correlation=.9.  You can see that the VIF stays relatively low, at least below our guideline for bad collinearity, until we get to that correlation=.9 point, and then it takes off.  Basically, things go from bad to worse increasingly quickly.  You’ll also notice that the mean VIF topped out at about 80, which corresponds to a correlation of about .994, which fits with what I put into the simulation.

How about the R squared of the standard regression?  It actually goes up with collinearity.  I think this is because the correlations help to constrain the data; it’s hard to have an outlier or odd observation if the predictors have to fit into a tight arrangement with each other.How much of an influence does the ridge have with increasing collinearity?  If we use the size of the lambda as a measure of the amount of regularization (remember that a lambda of 0 corresponds to standard linear regression), we can look by plotting lambda against correlation size:Somewhat surprising, to me, is that the lambda increases but then comes back down at very high collinearity, at least in general.  But it certainly increases across most of the range, showing that as collinearity gets worse you need more of a contribution from that extra factor that regularization adds in.

Next, how do standard and ridge regression respond to collinearity when explaining a data set, i.e. ‘predicting’ the data it was given?  Here’s a plot of mean absolute error for standard (linear) regression versus correlation followed by the same plot for ridge regression.Both graphs show that error is pretty unrelated to collinearity, although ridge regression has a few hiccups at the high end.  The scale on the graphs implies that ridge regression has a worse error in general, but that could be due to those odd points at the top right.  We can see more clearly that ridge regression is worse by directly plotting the errors against each other:I added a line with slope 1, which is where you would expect to see if the points were mostly similar and ridge regression had the same error at any given level of collinearity.  Instead you see that most (95%) of the points are above that line, showing that ridge regression gives you larger errors.  This is an unfortunate consequence of adding that bias to the inverse matrix; standard regression guarantees you the smallest overall error it can manage, and by definition ridge regression moves away from that.  Hopefully you knew this point from my retrodiction contest.

How do the regressions do in recovering the correct beta weights?  Since there are five weights (an intercept and one for each X), I decided to measure error using the Euclidean distance, or the square root of the sum of squared error.  The actual weights should be (0,1,2,3,4); if the regression spit out (0,0,0,0,0), for example, the error would be 5.48.  Here’s the standard and ridge coefficient error by correlation.We see that the standard regression stays pretty constant but then falls apart for some of the highest collinearities while the ridge regression does the same but doesn’t fall apart quite as much.  You can again see the relationship more clearly by plotting the two errors against each other and adding the slope=1 line; this time the dots are mostly (75%) on the linear side, showing that it tends to have the larger errors (I didn’t bother with the graph, but the code for it is included below).  This is less true at lower levels of correlation; standard regression starts to have a lower error than ridge more often as you cut out the higher correlation simulations.

Ok, so finally how do the regressions do at predicting the other half of the data?  Here are graphs, as above, for standard and ridge regression mean absolute prediction error plotted against correlation and each other. Again, standard regression gets worse with increasing correlation and blows up somewhat at the end.  Ridge regression does a little better but does still increase with increasing collinearity and has a smaller blow-up.  And the two actually have pretty similar errors overall, but when there’s a bad mistake it happens for standard regression.

Conclusions!  More collinearity is bad, even when you run a ridge regression to try and account for it.  If you’re running that regression within your data set, the error will be worse for the ridge regression.  However, the estimated beta weights will be a little better in general, mostly when there’s more collinearity.  And, as it should be, ridge regression typically does better when applied to a new sample.

As promised, here’s my R code.  You’ll need the packages that I load at the beginning.  I don’t make any claims that this is the easiest or most elegant way to do what I did, but it works.  If you have a quick question about the code you can leave a comment, but if it’s more involved drop me an email.  And don’t try to run all the plots at the same time, because they’ll draw over each other.  Do one, take a look, then run the next.

library(car)
library(MASS)
library(parcor)

correl=c(rep(seq(.99,0,by=-.01),5))
obs=500
noise=10
linearerr=NULL
ridgeerr=NULL
vifs=NULL
linearprederr=NULL
ridgeprederr=NULL
linearcoeferr=NULL
ridgecoeferr=NULL
ridgelambda=NULL
rsquared=NULL
for (a in 1:length(correl)) {
means=c(0,0,0,0)
covar=matrix(c(1,correl[a],correl[a],correl[a],correl[a],1,correl[a],correl[a],correl[a],correl[a],1,correl[a],correl[a],correl[a],correl[a],1),4,4)
Xs=mvrnorm(obs,means,covar)
Xsframe=data.frame(Xs)
Y=Xsframe[,1]+2*Xsframe[,2]+3*Xsframe[,3]+4*Xsframe[,4]+rnorm(obs,0,noise)
linfit=lm(Y[1:(obs/2)]~X1[1:(obs/2)]+X2[1:(obs/2)]+X3[1:(obs/2)]+X4[1:(obs/2)],data=Xsframe)
ridge.object=ridge.cv(Xs[1:(obs/2),],Y[1:(obs/2)])
linearcoeferr2=sqrt(linfit\$coef[1]^2+(linfit\$coef[2]-1)^2+(linfit\$coef[3]-2)^2+(linfit\$coef[4]-3)^2+(linfit\$coef[5]-4)^2)
ridgecoeferr2=sqrt(ridge.object\$int^2+(ridge.object\$coef[1]-1)^2+(ridge.object\$coef[2]-2)^2+(ridge.object\$coef[3]-3)^2+(ridge.object\$coef[4]-4)^2)
linearerr2=mean(abs(linfit\$fit-Y[1:(obs/2)]))
ridgefitted=ridge.object\$int+rowSums(ridge.object\$coef*Xs[1:(obs/2),])
ridgeerr2=mean(abs(ridgefitted-Y[1:(obs/2)]))
linearpred=linfit\$coef[1]+rowSums(linfit\$coef[2:5]*Xs[(obs/2+1):obs,])
ridgepred=ridge.object\$int+rowSums(ridge.object\$coef*Xs[(obs/2+1):obs,])
linearprederr2=mean(abs(linearpred-Y[(obs/2+1):obs]))
ridgeprederr2=mean(abs(ridgepred-Y[(obs/2+1):obs]))
linearerr=c(linearerr,linearerr2)
ridgeerr=c(ridgeerr,ridgeerr2)
linearprederr=c(linearprederr,linearprederr2)
ridgeprederr=c(ridgeprederr,ridgeprederr2)
linearcoeferr=c(linearcoeferr,linearcoeferr2)
ridgecoeferr=c(ridgecoeferr,ridgecoeferr2)
vifs=c(vifs,mean(vif(lm(Y~X1+X2+X3+X4,data=Xsframe))))
ridgelambda=c(ridgelambda,ridge.object\$lam)
rsquared=c(rsquared,summary(linfit)\$r.sq) }

plot(correl,vifs)

plot(correl,rsquared)

plot(correl,ridgelambda)

plot(correl,linearerr)
plot(correl,ridgeerr)
plot(linearerr,ridgeerr)
abline(0,1)

plot(correl,linearcoeferr)
plot(correl,ridgecoeferr)
plot(linearcoeferr,ridgecoeferr)
abline(0,1)

plot(correl,linearprederr)
plot(correl,ridgeprederr)
plot(linearprederr,ridgeprederr)
abline(0,1)