Moving right along with my look at regularized, or ridge, regression, this time I’m going to look at the impact of having more data. I mentioned previously that we might expect better estimates when more data are included; as players change teams and get involved in more different line-ups the collinearity decreases and makes the regression a bit more robust. So practically speaking, adding more seasons to APM or RAPM really conflates two things: less collinearity and more observations. I’m going to pull them apart; last time I covered collinearity and this time it’s just increasing the number of observations. This might be more akin to having more data within a single season, like checking after 20 games instead of 10. While you’ll get some line-up changes within a season due to injuries and trades and the like, it’s presumably less than what you get across seasons and so more similar to just increasing the sample size. Even more accurately, this might be similar to what happens if you tried to run APM within a playoff series after one game, then two, and so on.

The process is going to be just like last time, except instead of looking at various levels of correlation between variables I’m going to keep that constant and change the N. The code will be at the bottom again, and obviously similar to last time. I decided to look at a range from only 25 observations up to 10,000 in steps of 100, and each level gets five simulations.

Here’s a plot of mean VIF by number of observations. As with many things, increasing sample size serves to decrease the noise; given that the correlation is roughly fixed the VIF should be as well, but when you’re at the mercy of small sample sizes odd things can happen. The effect on R squared is similar; for the particular parameters I chose the R squared tended to be around .5.Next we have the affect of sample size on lambda. Remember that this is an index of how much regularization is going on. As the sample size increases, the lambda increases as well.That’s really about it, because the rest of the plots turn out exactly as you would expect. The retrodiction error stabilizes but doesn’t really improve (since I’m using mean error) with increasing sample size, and standard outperforms ridge regression. The beta weight error decreases with increasing sample size for both, and ridge generally outperforms standard; the same pattern holds for prediction error. So, not surprisingly, you want a larger sample size if possible, even if the collinearity in your data stays constant.

Here’s the R code. If you looked at last post, it’s pretty much the same except I increment observations (obs) instead of correlation. Also I changed the graphs to plot values of interest against observations instead of correlation.

library(car)

library(MASS)

library(parcor)

correl=.99

obs=c(rep(seq(25,10000,by=100),5))

noise=10

linearerr=NULL

ridgeerr=NULL

vifs=NULL

linearprederr=NULL

ridgeprederr=NULL

linearcoeferr=NULL

ridgecoeferr=NULL

ridgelambda=NULL

rsquared=NULL

for (a in 1:length(obs)) {

means=c(0,0,0,0)

covar=matrix(c(1,correl,correl,correl,correl,1,correl,correl,correl,correl,1,correl,correl,correl,correl,1),4,4)

Xs=mvrnorm(obs[a],means,covar)

Xsframe=data.frame(Xs)

Y=Xsframe[,1]+2*Xsframe[,2]+3*Xsframe[,3]+4*Xsframe[,4]+rnorm(obs[a],0,noise)

linfit=lm(Y[1:(obs[a]/2)]~X1[1:(obs[a]/2)]+X2[1:(obs[a]/2)]+X3[1:(obs[a]/2)]+X4[1:(obs[a]/2)],data=Xsframe)

ridge.object=ridge.cv(Xs[1:(obs[a]/2),],Y[1:(obs[a]/2)])

linearcoeferr2=sqrt(linfit$coef[1]^2+(linfit$coef[2]-1)^2+(linfit$coef[3]-2)^2+(linfit$coef[4]-3)^2+(linfit$coef[5]-4)^2)

ridgecoeferr2=sqrt(ridge.object$int^2+(ridge.object$coef[1]-1)^2+(ridge.object$coef[2]-2)^2+(ridge.object$coef[3]-3)^2+(ridge.object$coef[4]-4)^2)

linearerr2=mean(abs(linfit$fit-Y[1:(obs[a]/2)]))

ridgefitted=ridge.object$int+rowSums(ridge.object$coef*Xs[1:(obs[a]/2),])

ridgeerr2=mean(abs(ridgefitted-Y[1:(obs[a]/2)]))

linearpred=linfit$coef[1]+rowSums(linfit$coef[2:5]*Xs[(obs[a]/2+1):obs[a],])

ridgepred=ridge.object$int+rowSums(ridge.object$coef*Xs[(obs[a]/2+1):obs[a],])

linearprederr2=mean(abs(linearpred-Y[(obs[a]/2+1):obs[a]]))

ridgeprederr2=mean(abs(ridgepred-Y[(obs[a]/2+1):obs[a]]))

linearerr=c(linearerr,linearerr2)

ridgeerr=c(ridgeerr,ridgeerr2)

linearprederr=c(linearprederr,linearprederr2)

ridgeprederr=c(ridgeprederr,ridgeprederr2)

linearcoeferr=c(linearcoeferr,linearcoeferr2)

ridgecoeferr=c(ridgecoeferr,ridgecoeferr2)

vifs=c(vifs,mean(vif(lm(Y~X1+X2+X3+X4,data=Xsframe))))

ridgelambda=c(ridgelambda,ridge.object$lam)

rsquared=c(rsquared,summary(linfit)$r.sq) }

plot(obs,vifs)

plot(obs,rsquared)

plot(obs,ridgelambda)

plot(obs,linearerr)

plot(obs,ridgeerr)

plot(linearerr,ridgeerr)

abline(0,1)

plot(obs,linearcoeferr)

plot(obs,ridgecoeferr)

plot(linearcoeferr,ridgecoeferr)

abline(0,1)

plot(obs,linearprederr)

plot(obs,ridgeprederr)

plot(linearprederr,ridgeprederr)

abline(0,1)

Pingback: RAPM – Conclusions | Sport Skeptic