This post follows directly from my last; you should go read that now. In terms of improving our NFL winning model, since I already mentioned it, let’s start with opponent’s penalty rate. If I add that to the model I started with (the same one that Brian reports), I move up to an R-squared of .777. That’s higher than the .76 I started with, but is it higher enough? That question is, in essence, what statistics was invented to answer, and something usually ignored by many people. If one number is higher than another, then by golly it’s better, right? It isn’t always true. In the case of R squared, it is impossible to get a lower number when you add a variable; it can only stay the same or go up. And if it doesn’t go up much, then we say that the new variable doesn’t significantly explain more than our previous model. Maybe it explains an extra half of a percent of the variance, but that isn’t really ‘enough’ in the statistical sense. We test this by running an ANOVA between the two models (at least that’s one way to do it, and how I do it). In the case of opponent’s penalty rate, the added explanatory power is indeed significant; this is supported by looking at the significance value for the opponent penalty rate beta value, which is very small. Also, it isn’t necessary but it’s nice to see that it’s roughly the opposite of the team’s penalty rate, -.456 versus .47, meaning that a penalty on the other team helps as much as a penalty to my team hurts. Symmetry isn’t necessary, but it seems plausible here.
Working in this fashion, we can add additional variables to my model of winning and see if they significantly add to what I know about what leads to winning. So what matters in the end? In addition to what Brian lists and opponent penalty rate, average punt distance, number of interceptions, times sacked, first downs, and points scored (each for both a team and it’s opponents; everything new is per-game average, not per-play efficiency) all improve the model up to an R squared of .88. However, what we have to watch for now is what is called colinearity. Colinearity is when some of your explanatory variables also explain each other (in addition to explaining your dependent variable, which is what we want). This can lead to two things: making previously significant variables non-significant, and ‘bouncing betas’, or unreliable regression weights. Adding first downs to the model, for example, leads fumbles lost rate and running efficiency to no longer be significant, and the effect of passing efficiency has been cut in half, although it is still significant. Putting average points scored in the model makes virtually everything else insignificant. Brian has argued, and reasonably so, that first downs and points are the consequence of the other events, such as passing efficiently, throwing interceptions, etc. So in this case, we will lean on theory a little bit in choosing our model and refrain from keeping first downs or points scored.
However, going through I’ve noticed that adding some variables seem to affect other variables. For example, adding average number of interceptions thrown improves the model, but makes some other variables non-significant. What happens if instead of the efficiency variables, we use the count variables? It turns out that this model does not fit as well. So we should, in general, prefer to use the efficiency variables. But, improving on Brian’s model, we should include opponent penalty rate and (team and opponent) average punt distance. It’s possible (perhaps likely) that other special teams information would be useful, but I don’t happen to have anything besides number of punts and punt distance. But as it stands, the model has improved from an R squared of .76 to .785, which is highly significant. In terms of symmetry, it is slightly more costly for a team to throw an interception than it is helpful to intercept the opponent, and the same is true of penalties, but passing and rushing offense are more helpful than defense. Passing offense and defense are still the most important thing, followed by rushing and penalties, interceptions, and roughly equal contributions from fumbles lost and punting.
While this makes for a neat story, and avoids having to worry about colinearity in the data even when it might improve the model, it should be noted that this isn’t the whole story. Using only points scored and points given up, we can predict wins with an R squared of .845, and the biggest model, with colinearity issues, has an R squared of .88. In fact, we can keep an R-squared of .872, which is better than points alone but not worse than the kitchen sink, using points scored, sacks, interceptions, interception rate, and passing efficiency (offense only). This implies that there are things not in the model that lead to scoring points (the most likely contributors are special teams scoring and turnovers returned for touchdowns; perhaps also the location on the field where turnovers occur), and additionally that even average points scored and given up per game does not fully account for winning. This should be expected; even if a team outscores its opponents on average, it will be outscored in one game or another and lose.
So, now that we’ve looked at how to go about choosing a model that shows us what led to a team winning games after the season is over, the next thing to do (and my big project for the fall) is to choose a model that predicts if a team is going to win their next game, before that game happens. Yes, it’s time to enter the wide world of sports betting. Stay tuned!