There’s an interesting post over at the Harvard Sports Analysis Collective on the accuracy of the power rankings for ESPN versus Football Outsider’s. I mostly just wanted to point to it, but I have a couple thoughts below.
First, I think it’s interesting that the ESPN ratings are pretty much as good as the FBO ratings. Yes, FBO’s are numerically higher (about 1.5%), but even across nearly 1000 games they aren’t significantly different. It’s hard to say what methods the ESPN guys use to rank their teams, but apparently in the end they are just as good as the fancy stats that FBO uses.
Thought 1a is that this is a good illustration of how many observations you need to tell things apart when they’re really close. If I asked you how many times you’d have to flip a coin to tell if it were 61% biased versus 63% biased, I don’t think many people would say ‘about 5000 times’. Translated to the power rankings issue, if we assume that ESPN is truly 61% accurate while FBO is truly 63% accurate, a typical statistical test wouldn’t say FBO is significantly better until about 20 seasons of data were collected.
Second, this should make an important point about the uncertainty (or noise) in football. If you take the mental shortcut that ESPN’s rankings are ‘common sense’ and FBO’s are ‘fancy stats’, then you would think that between them they would have a pretty good sense of who should win. Yet the better team only wins 63% of the time. In other words, games that could be construed as upsets happen over a third of the time. That’s a ton of upsets.
Third – the first comment on the post mentions home field advantage (I believe it could be translated to “I think this post would be greatly improved by taking home field advantage into consideration”). Over a decent chunk of time, the home team in the NFL wins about 56% of games. So obviously these ratings do better than just picking the home team to win every game. But it’s also true that taking home field into account would help; a team with home field advantage should beat a slightly better team, which would not be predicted by that post’s analysis. My Luigi model has picked the correct winner about 67% of the time, I believe, using a rating along with home field. So you can do better, but it isn’t a giant improvement.
Finally, the second comment is an interesting one. It says that one would imagine that ESPN and FBO agree a fair amount of the time, and being right or wrong in that instance isn’t a big deal; no one predicted (or would reasonably predict) the Cardinals-Patriots upset, for instance. So wouldn’t it be better or more informative to look at which method does better when they disagree? I’ve seen suggestions like that before in various circumstances, and it sounds like a reasonable one. But today I realized that it’s meaningless because you can’t find out anything besides what you already know.
Let’s say for the sake of illustration that the commenter is right and ESPN and FBO agree about 80% of the time. And let’s say those games are generally a bit ‘easier’ than other games, so each method is right 67% of the time. ESPN and FBO have to have the same record in those games since they agree. So out of 1000 games, they agree on 800 and both go 536-264. That leaves 200 games where they disagree. But we know that overall ESPN was 61.8% and FBO was 63.3%. In 1000 games total, that means ESPN went 618-382 while FBO went 633-367. If we take away their record in the 800 agreed-upon games from their record in the 1000 total games, we see that ESPN went 82-118 while FBO went 97-103 in the 200 disagreed games. Or, ESPN was 41% while FBO was 48.5%. In other words, FBO was better, but we already knew that from the overall record. And while the difference looks larger now than it did before (FBO is ahead by 7.5% instead of 1.5%) it’s in a smaller batch of games (200 versus 1000), so we still can’t be sure that FBO is significantly better.
The size of that difference will also depend on how many games they actually agree on and how accurate the methods are in those agreed-upon games. If they actually only agree 60% of the time and are 63% accurate in those games, then ESPN is 240-160 in disagreed games while FBO is 255-145, or 60% versus 64%. The difference is smaller now, and still not significant.
So, commenter Chase was sort of on to something by saying we should look at the games where the two methods disagree; doing so will likely make the better method look better. But does that tell us anything that we didn’t already know? FBO came out a tiny bit better, and if we were to limit ourselves they would still look a bit better. In either case we would need far more games than we currently have to confidently say that FBO creates better rankings than ESPN. And once we had those games, what would we conclude? FBO does a little bit better, which, again, we already knew.