## Predictivity Part 2

Brian Burke has a new post at Advanced NFL Stats taking another look at what variables are predictive of winning.  The idea is that some thing are very predictive of having won games (like scoring a lot), but may not be repeatable.  To test this, he made a measure called ‘predictivity’ that multiplies the correlation between a measure (like passing efficiency) and winning and the correlation between a measure and itself when you split the season in half.  Thus a measure can be predictive by correlating well with wins or by correlating well with itself, or preferably both.  Using Brian’s example of defensive pass efficiency (how well teams pass against you), it correlates well with winning but not with itself; teams are not stable in terms of pass defense.  And so Brian’s conclusion is that defensive pass efficiency may not be too useful in a predictive model.

On the other hand, I have a predictive model.  So I can cut out the middle man and look directly at which variables predict winning.  Brian has one as well, so I’m not sure why he didn’t do this (perhaps he’s planning to); in any case, it’s cool to see the data.  I don’t think many people know that interceptions are fairly random on both sides of the ball, and the self-correlations are a good reminder.

My model predicts winners game-by-game, so it uses a logistic regression (check the link in the banner for a description of the model).  Thus I can’t look at a correlation per se, but instead I can look at how well individual variables do in predicting winning.  And this is truly predictive because the dependent ‘winner’ variable is if the team wins their next game (more accurately, it’s if the home team will win the next game; home field is built directly into the model instead of being included as a parameter).  This regression naturally combines Brian’s two correlations; if a variable doesn’t correlate with winning in the past, it is unlikely to correlate with future winning, and if a variable is unreliable, then it should not vary with future winning (e.g. the number could be anything, and so it should be irrelevant to winning).

The measure of predictivity is AIC, which allows you to compare different models describing the same dependent variable.  Lower is better.  Points scored, for example, is actually a very good predictor of future winning: the parameter estimate is highly significant at p=2.35×10-14, and the AIC is 2167.1 (that number is meaningless until we compare it to another model; just keep it in mind for now).  But we tend to poo-poo points scored because we think that’s the outcome of other variables, like passing and running well.  Without further ado, here’s a chart with some of the same variables that Brian looked at and their AICs:

(note that offense fumble and interception rate means turnovers by your team and the defensive values are turnovers by your opponent)

What you can see is that out of all the single-variable models, points is actually the most effective.  And that’s because points scored is largely a stand-in for most of the other variables (perhaps not the defensive ones; we could put in points against for that).  But we can also see that, for example, offensive pass efficiency is the single-most predictive variable, as Brian often says, but fumble rate isn’t as bad as he finds in his measure.  Defensive fumble rate, in particular, appears to be the most predictive variable around after offensive and defensive pass efficiency: if your opponents have fumbled a lot in the past, you will win more often in the future.

The last two lines in the table are there to remind us that when it comes to predicting winners, it is not a mistake to think at a more complex level.  The ‘exclude points’ line is the AIC for a model containing all of the individual variables above but not points, and it fits better than all the others by a fair margin, which should be expected.  We know more about winning if we combine the information available (although points for and points against fit nearly as well as all the other variables combined).  The last line is the same thing except I included points scored and points given up; the model becomes even more predictive.  So there is still predictive information carried by the score that does not appear in the variables listed.  Perhaps Brian’s success rate has some of that information; I don’t have that in my data set, so I can’t say.  But if I did, I would evaluate it directly by using it to predict future winning and skip the middle man of reliability and correlating with the past.