The last thing I wanted to look at with regularized regression is the impact of the number of predictors. With NBA data, you would optimally get an estimate for every player who gets in a game. However, due to collinearity and sample size issues, APM is sometimes calculated with a group of players all set to the same value (referred to as replacement level players). Again, I’m not going to change the collinearity or sample size, as least explicitly, but here I’ll manipulate how many predictor variables are in the data set.

The general procedure is as before except now I’m incrementing the number of predictors (X variables) from 2 to 500 by steps of 25, with five simulations at each level. I had some issues with memory use; otherwise I was hoping to go to a higher number of predictors. I also ramped the noise up because otherwise the linear model was perfect with more than 2 predictors, and I increased the sample size to compensate for the number of possible predictors. These two factors probably played off of each other, so R squared still ended up high in general.

Speaking of R squared, here’s a graph of how it varied with increasing predictors. Despite strong collinearity (presented in the following graph of VIF by number of variables), the R squared increases quickly and saturates at 1. This is because the overall fit of the model benefits from all that information and redundancy in the variables.How about the lambda for the ridge regression? As more variables are added, the standard regression does better and better in terms of fit. So as you might guess, the lambda decreases – the ridge regression becomes more like standard regression.How well do the regressions describe the data they are fit to? As we saw before, the standard regression is better than the ridge regression. There’s also kind of an interesting pattern in that the standard regression gets better with more variables whereas the ridge regression gets worse.How about measuring performance by recovering the coefficients? Because I’m varying the number of predictors, I’m also varying the number of coefficients. And because I’m measuring accuracy by distance from the actual coefficients, accuracy would have to get worse when more variables are included. To try and mitigate this a bit, I divided the distance error by the number of variables. The standard regression gets better with more variables, while the ridge regression stays about the same or maybe gets a tiny bit worse. But in any event, the ridge is always better than the standard regression.Finally, we can look at how well the standard and ridge regressions transition to new data (i.e., make predictions). The standard regression gets a bit worse with more variables, presumably because even though it gets a decent fit the mistakes it does make are very susceptible to overfitting. Interestingly, the same is true for the ridge regression. This is probably because as we saw above, the ridge regression gets closer to standard regression with more variables, and so it picks up this bad trait. Ridge still handles it better though, with a smaller error regardless of the number of variables. So – more variables is generally going to improve the fit of your regression even if they’re pretty collinear. The mean error and coefficient error decrease and the R squared goes up. This is also true for ridge regression because as the standard regression did increasingly better, the ridge regression decreased its bias to become more standard-like. However, ridge continued to be worse at describing the data it fit while being better at predicting new observations.

Here’s the R code. It’s a bit cleaned up compared to previous versions because I had to make the covariance matrix and regression portions more flexible. Also note that it takes a while to run because of the increased number of observations (to accommodate the large number of variables), which really slows down the ridge regression cross validation. You might want to increase R’s memory access before running this one.

library(car)

library(MASS)

library(parcor)

correl=.99

obs=10000

noise=500

numvar=c(rep(seq(2,500,by=25),5))

linearerr=NULL

ridgeerr=NULL

vifs=NULL

linearprederr=NULL

ridgeprederr=NULL

linearcoeferr=NULL

ridgecoeferr=NULL

ridgelambda=NULL

rsquared=NULL

for (a in 1:length(numvar)) {

means=c(rep(0,numvar[a]))

covar=matrix(c(rep(0,numvar[a]*numvar[a])),numvar[a],numvar[a])

for (i in 1:numvar[a]) {

for (j in 1:numvar[a]) {

if (i==j) {

covar[i,j]=1 }

else {

covar[i,j]=correl } } }

Xs=mvrnorm(obs,means,covar)

Xsframe=data.frame(Xs)

for (z in 1:numvar[a]) {

if (z==1) {

Y=Xs[,1] }

else {

Y=Y+Xs[,z]*z }}

Y=Y+rnorm(obs,0,noise)

linfit=lm(Y~.,subset=c(1:(obs/2)),data=Xsframe)

ridge.object=ridge.cv(Xs[1:(obs/2),],Y[1:(obs/2)])

linearcoeferr2=dist(rbind(linfit$coef,c(0:numvar[a])))/numvar[a]

ridgecoeferr2=dist(rbind(c(ridge.object$int,ridge.object$coef),c(0:numvar[a])))/numvar[a]

linearerr2=mean(abs(linfit$fit-Y[1:(obs/2)]))

ridgefitted=ridge.object$int+rowSums(ridge.object$coef*Xs[1:(obs/2),])

ridgeerr2=mean(abs(ridgefitted-Y[1:(obs/2)]))

linearpred=linfit$coef[1]+rowSums(linfit$coef[2:(numvar[a]+1)]*Xs[(obs/2+1):obs,])

ridgepred=ridge.object$int+rowSums(ridge.object$coef*Xs[(obs/2+1):obs,])

linearprederr2=mean(abs(linearpred-Y[(obs/2+1):obs]))

ridgeprederr2=mean(abs(ridgepred-Y[(obs/2+1):obs]))

linearerr=c(linearerr,linearerr2)

ridgeerr=c(ridgeerr,ridgeerr2)

linearprederr=c(linearprederr,linearprederr2)

ridgeprederr=c(ridgeprederr,ridgeprederr2)

linearcoeferr=c(linearcoeferr,linearcoeferr2)

ridgecoeferr=c(ridgecoeferr,ridgecoeferr2)

vifs=c(vifs,mean(vif(lm(Y~.,data=Xsframe))))

ridgelambda=c(ridgelambda,ridge.object$lam)

rsquared=c(rsquared,summary(linfit)$r.sq) }

plot(numvar,vifs)

plot(numvar,rsquared)

plot(numvar,ridgelambda)

plot(numvar,linearerr)

plot(numvar,ridgeerr)

plot(linearerr,ridgeerr)

abline(0,1)

plot(numvar,linearcoeferr)

plot(numvar,ridgecoeferr)

plot(linearcoeferr,ridgecoeferr)

abline(0,1)

plot(numvar,linearprederr)

plot(numvar,ridgeprederr)

plot(linearprederr,ridgeprederr)

abline(0,1)

Pingback: RAPM – Conclusions | Sport Skeptic