## Predicting the Playoffs: NBA Edition, Part One

A few years ago, Henry Abbott at the TrueHoop blog on ESPN.com started what he calls the Stat Geek Smackdown, where people known as stats guys in the NBA community pick the outcome of the various match-ups throughout the playoffs.  I thought this was kind of neat so I wanted to see if I could come up with a model for making the picks.  Following on things that Henry mentioned, and that participant (and 2009 champ) Dave Berri said on his website, I decided to use point differential and home court advantage as my predictors.  Point differential is the difference in points scored and points given up per game, and can be found pretty easily, such as on ESPN’s standings page.  At first I didn’t think that home court advantage would matter much in the playoffs (being skeptical of the general wisdom and all), but you never know until you look.  So I put together a spreadsheet with playoff series info going back to 2003, when the Spurs beat the Nets, up through the season that just ended.  An entry look like this:

games won    point difference    opp. difference    difference   home court    team    winner

4                      7.6                          2.7                         4.9               1                      Lakers    1

That describes the Lakers’ first round series against Utah in 2009.  The Lakers won 4 games (e.g. won the series); they had a point differential of 7.6 through the regular season while their opponents (Utah) had a 2.7, for a difference of 4.9.  The Lakers had home court advantage, and won the series.  The first thing I did was run a logistic regression predicting winning a series from the point difference.  It looks like this:

This graph shows the typical probability curve that I described in my earlier post; probability increases slowly at the bottom, speeds up to almost linear through the middle, then slows down again as you get closer to 1 (or 100%).  Obviously there is a relationship between point difference and winning; the more points you typically won by in the regular season compared to your opponent, the more likely you are to win the series against that opponent.  That graph ignores home court advantage; if we take that into account, we get this graph:

The top curve is for the team with home court, the bottom is for the team without.  This looks like a pretty big difference, but the effect of home court is only at trend significance (p = .07) in the model.  What that means is that even though the home team should do better numerically, there’s too much noise to be sure of that (at least at the typical significance level).  And while I’ve seen some other blogs that don’t seem to fully grasp the importance of p value and poo-poo it, this might be a situation where the p value isn’t the final word.  After all, we’re interested in the best description of winning a playoff series, so instead maybe we should look at categorization accuracy.  What we can do is use the model to predict a winner; for example, if the home team has a 50% or better chance according to the model, we would pick them.  If they actually win, that is called a hit; if they lose, it’s called a miss.  If the away team is picked to win but doesn’t, it’s a false alarm; if they do, it’s a correct rejection.  The percentage of times the home team won out of how many times we picked them is the hit rate, and the percentage of times the home team won when we picked the away team is the false alarm rate.  If the model is better than guessing, the hit rate should be above the false alarm rate.  We can get these two numbers for a lot of criteria values (this time it was 50%; we can calculate hit rate and false alarm rate at each cut-off from 0 to 100) and plot what is called an ROC curve, which is all the corresponding hit and false alarm rates at all of these criteria.  The ROCs for the point difference-only and point difference plus home court models are below.

The models are about the same most of the time, but the home court model (solid line) does pop out at some places, indicating that it has a better categorization accuracy (the diagonal line indicates chance, so both models are definitely better than guessing).  So between the ROC and the trend significance value, I feel pretty good about including home court.

So what’s the upswing here?  If you didn’t think that home court matters, you would just take the team with the better differential.  If you do think home court matters, then the model says that the home team can be almost a point worse (point differential = -1) and be at 50/50.  Conversely, the away team can be almost a point better and only be a coin flip.  So the benefit of home court advantage, gained by winning more games than your opponent, is that you can actually be a worse team and have a higher chance of winning (although you can’t be too much worse).  Many people would question why a team with fewer wins would be the better team, but it’s pretty well established that point differential is a better indicator of team quality than wins.