## Is Efficiency All You Need? Model Building Part 1

Advanced NFL Stats has both driven and pre-dated my interests in sports analysis.  I thought that the use of regression in basketball was great after reading the Wages of Wins, and then found that Brian Burke at ANFLS (that’s what all the cool kids call it.  Or me, with this being the first time ever) had been doing the same with football.  The model that I’ll be using over the course of the season (it’s coming soon, I promise) is very similar to his, but not quite the same.  Part of the reason for the difference will be explained here.

Brian has a well thought-out idea behind his analysis, which is that we should care about skilled, repeatable performance when analyzing teams or players.  For example, he doesn’t use or care much for special teams because return touchdowns are essentially random, and he doesn’t use fumbles lost in his predictive model because recovering a fumble is essentially random.  On the other hand, losing a fumble in the first place does seem to correlate across seasons, which is to say that the people who fumble a lot one year tend to fumble a lot the next year, too.  Additionally, Brian uses efficiency measures like passing yards per pass attempt (passes plus sacks) instead of straight ‘count’ measures like passing yards because efficiency reflects a team’s ability whereas the count measures depend on how a game is going.  For example, teams that run a lot and get lots of rushing yards tend to win, but that could very well be because teams that get a lead one way or another run more to use up the clock, even if they aren’t good at running.  Thus running doesn’t cause winning, but is instead an effect of it.  I tend to agree with most of Brian’s thoughts when it comes to these kinds of things.

However, I disagree with his analysis in some places.  For example, he recently updated his post showing that passing efficiency is better connected to winning than running efficiency, or anything else for that matter.  The analysis is both exploratory and explanatory; we’re trying to see what leads to winning after we know who won, and we’re trying to find out what is important out of all the possible things that could be important.  As you see in Brian’s post, offensive passing efficiency is the most important, but also important are defensive passing efficiency (e.g. how effective your opponent’s passing game is), offensive and defensive running efficiency, interceptions, fumbles lost, and the offense’s penalty rate.  The relative importance of these can be compared by scaling or normalizing each variable, which subtracts the mean from each observation and then divides by the standard deviation.  This creates variables where the mean is 0 and the standard deviation is 1; variables can thus be compared fairly instead of trying to equate passing yards to number of interceptions.  Above-average performances have positive values and below have negative.  For example, passing efficiency in my data set (2004 to 2009) has a mean of 6.05 yards/dropback and a standard deviation of .868.  In 2004, the Cowboys put up 6.16 passing yards/dropback, which converts to a standardized score of .119, or slightly above average.  However, their running efficiency has a standardized score of -.371, which is below average.  And because we use standardized scores, we can say that their passing is not as above average as their rushing is below average.  However, since passing is more important, it turns out that their slightly above average passing just about cancels out their more below average rushing.  This is all great; my complaint comes with the choice of variables to include.

Since we’re interested in an exploratory analysis, we shouldn’t start out tied to any theory.  For example, penalties committed by opponents is not in the model, and I don’t see any particular reason why not.  So here I’m going to grow a model out of my data set that best describes how many games a team wins in a season.  Here’s the background: my data set covers 2004 to 2009, while Brian’s starts in 2002.  If I run the regression he reports on my data, my regression weights are a little bit off from his, and I have an R-squared (variance accounted for) of .76 while Brian reports .82.  Given that he has two more years, or another third of my data set, I’m not concerned about the differences.  More data should lead to a higher R-squared, and his beta values (regression weights) are within the confidence intervals of what I get, so I think we’re working from pretty much the same place.  On to the analysis…