As I’ve been thinking about my football models and their less-than-stellar season of predictions, I was reminded about the distinction between accuracy and precision. Depending on the circumstances, those two words can mean a lot of different things. If you’re watching a football game and the announcer says the QB has an accurate arm, and then two quarters later says the guy is very precise with his throws, he probably meant the same thing each time and happened to pick different words. But in at least some science/math circles, accuracy and precision mean different things. In the interests of stats literacy, I’ll talk a bit about what they mean and how it applies to sports. Head’s up: this is a long one, so grab a snack or something.
The wikipedia article on accuracy and precision does a pretty decent job of describing the two in the way I think of them. In short, accuracy describes if your method (a math model, a chemical test for how much of some substance is in a sample, whatever) produces values that tend to be close to the true value. Precision describes the reproducibility of that method; if you were to run it on the same sample again, would it give you the same result?
A common example is dart throwing (also illustrated in the wiki article). Imagine the game is to hit the bulls-eye. An accurate dart player would have his darts centered on the bulls-eye. He might not hit it every time, but he’s basically as likely to hit high as low or left as right. An inaccurate thrower would be biased in some way, maybe tending to throw high or to the right. Separate from that, a thrower could be precise, meaning that he generates tight clusters of throws, or imprecise, meaning that the throws tend to be spread out from each other. A precise, yet inaccurate, thrower might put all of his darts in the 20 while aiming for the bulls-eye. An imprecise, yet accurate, thrower might put his darts in a circle around the bulls-eye (again, when each throw was supposed to go dead center). A precise and accurate thrower would be Bullseye while an imprecise and inaccurate thrower would be an accident waiting to happen.
Outside of being able to describe dart players, what do accuracy and precision have to do with sports? Well, a lot of sports analysis is based on regression these days. Regression is interesting from this perspective because, assuming that the basic statistical assumptions for regression are met, the regression equation you get out is guaranteed to be accurate. Let’s say I run a regression using people’s height to predict their weight. If you did everything right and the equation says that 6 foot tall people weigh 190 pounds (just making up numbers here), then by golly 6 foot tall people do indeed weigh 190 pounds… on average.
The real issue with a regression is that while it’s guaranteed to be accurate, there are no guarantees as to its precision. You should know going in that not every 6 foot person is going to weigh 190 pounds. There will be some spread to the data, and thus your prediction will be wrong for some people. The question is only how much. Will they tend to be pretty close to 190, or could they be anywhere from 120 to 400 pounds? One way to describe that spread would be the r squared value for the regression, which I’ve talked about before.
Let’s move to a sports example. The one I like to use when talking about this kind of topic is the pioneering work on the usage-efficiency trade-off done by Eli Witus. The article can be found here. In short, he took the offensive ratings and usage rates for individual players and used it to calculate a predicted offensive rating for a line-up made of five of those players. He then found the actual offensive rating that line-up generated when it played in actual NBA games. What Eli found is that the regression equation predicting actual line-up efficiency from line-up usage has a slope that is positive but less than one. This means that line-ups consisting of players with below-average usages have offensive ratings below the sum of their individual ratings while line-ups of players with above-average usages have ratings above what you would expect from the individual players.
Given what I said earlier about regression and accuracy, this means that you should expect a usage-efficiency trade-off. If you create a line-up out of five players whose usage adds up to less than 100%, their offensive rating as a group should be less than the sum of their individual ratings – they’ll have to increase their usage and they will be less efficient. But what we don’t know is how precise this model is. Eli’s post has some relevant info in his R output, but I wanted to be able to visualize it as well. I don’t have the actual data so I made a fake data set with the aim of getting it to line up with his fourth analysis, where he runs the regression on the 244 line-ups that had at least 100 possessions together.
I started by making a line-up usage variable. In his chart it appears to range from about -.12 to .12. I don’t know if it’s normally distributed or not, but let’s assume it is; we do know that it’s centered at 0 because he centered all the usages. We know we want 244 observations. In R, I used the code usage=rnorm(244,0,.045) to create a variable called ‘usage'; it has 244 entries (one for each fake line-up) with a mean of 0 and a standard deviation of .045. That SD seemed to put me in about the range I wanted to be; my data set has a range from -.109 to .117. Close enough to the real thing (note that rnorm will pull different random numbers each time, so if you run this or I run it again, the numbers will change a bit).
Next I need to create the efficiency difference variable. We want it to have a certain relationship with usage; namely, it should fit the equation that Eli found, efficiency difference = -.01 +.27*usage. To illustrate what the noise might look like and to see how much noise I need to get the other regression values that Eli reported, I made a few versions of the efficiency difference variable, which you can see below.
First I made one with no noise. That’s accomplished by the R code effdif=-.01+.27*usage. You just use the equation and assume it’s perfectly precise. Then you fit a regression to the data with the code fit1=lm(effdif~usage). I plotted the data using plot(usage,effdif) and put the regression line on it with abline(fit1$coef,lwd=2). Here’s the graph:
As you can see, every data point (the circles) falls directly on the line described by the regression equation. If we look at the details for the regression (summary(fit1) ), we see that the errors for the regression coefficients (the intercept and the slope for usage) are extremely tiny and the R squared value is 1. That means we can perfectly predict line-up efficiency from line-up usage: the model is perfectly accurate and precise. This is obviously not the case in real life, so we need to add some noise to the data. I do this by creating a second set of efficiency differentials; for example, I can use the code effdif2=-.01+.27*usage+rnorm(244,0,.01). This creates a set of efficiencies that follow the regression equation but each individual observation has random noise added to it; that noise has a mean of 0 (meaning it should leave the mean efficiency difference at -.01) and a standard deviation of .01. In short, some line-ups will be more efficient than expected by the regression equation and others will be less efficient. The data and regression line I got produce the graph below.We see that the data don’t all fall on the same line, but the regression equation itself is pretty much the same (-.009+.28*usage). What changes are the error (or precision) numbers associated with the regression. The r squared is now about .6 and the standard errors for the regression slope is about .015. Eli’s numbers were .04 for r squared and .09 for the slope error; let’s see if I can get that. I had to move the noise SD up to .065, and a run of that gave an r squared of .048 and a slope error of .088; close enough. As you can see, the data points are spread out from the regression line pretty well. More importantly, you can see that there are plenty of dots to the right of 0 (high-usage line-ups) where the efficiency differential is below 0 (they became less efficient) and also plenty of dots to the left where the differential is above 0 (low-usage line-ups that became more efficient). These dots describe a reverse usage-efficiency trade-off: increasing your usage is associated with being more efficient.
So what gives? As it turns out, the relationship between usage and efficiency isn’t very precise. Part of this is described by the r squared; check out that post I linked to earlier for a longer discussion. Part of it is also illustrated by the standard error on the slope. In Eli’s data, it’s .09. We can use that along with the slope (.27) to create confidence intervals for the slope; the 95% confidence interval should be about .09 to .45. What does that mean? It means that if we were going to collect a new set of data (say, the usages and efficiencies from last year instead of 2007) we could be 95% sure that the slope in that regression would fall between .09 and .45. That’s quite a range, huh? But you’re probably thinking that the range is all above 0, so we should still have the usage-efficiency trade-off that we know and love.
Unfortunately it gets a bit worse if you were a coach and wanted to know what you could expect from a particular line-up. Note that the range I just mentioned covers the possible values using all the line-ups in an entire season. A particular line-up in that season would have its own noise compared to that slope, the same way that the points in the graphs above don’t all line up perfectly with our regression line. To get intervals for a particular observation (what is typically called a prediction interval as opposed to a confidence interval) you have to add additional uncertainty to that range. I took that last graph from above and add some confidence interval points (triangles) and prediction interval points (plus signs) to give you an idea of the spread involved.The confidence intervals essentially say that we can be 95% sure that future group averages will fall in that range while the prediction intervals say that we can be 95% sure that individual line-ups will fall in that range. If I take every line-up where the individual players’ usages add up to 100% (an average line-up), I can be 95% sure that the average efficiency of all those line-ups will be between -.9 and .9 of their expected offensive ratings. If I picked one group of guys whose individual usages add up to 100%, I can be 95% sure that their offensive rating will be somewhere between 12 points above or below their expected rating. For reference, there have been 87 line-ups so far this year that have played at least 100 minutes together. That 24 point swing would cover 72 of them, moving you from Chicago’s top unit to their worst. It’s more than the difference between when the Thunder play Durant, Ibaka, Martin, Sefolosha, and Westbrook together (122.6 points per 100 possessions) and when they play Collison, Martin, Maynor, Sefolosha, and Thabeet together (106.1 points per 100 possessions). But while you might expect that second group to be worse than the first, knowing who they are, remember that’s the range of possibilities for a single line-up with an average usage rate.
To be a little more concrete and topical: the Pistons just traded for Jose Calderon and got rid of Tayshaun Prince and Austin Daye. What should they expect if they ran out a line-up of Calderon (18.2% usage), Stuckey (21.5%), Singler (15%), Monroe (25%) and Drummond (16.4%)? The equation says their 96.1% usage should take their projected offensive efficiency of 109.5 down to 107.5, but it could really be anywhere from 95 to 119.
Going back to where I started, I believe that my football model is accurate the same way people believe that there’s a usage-efficiency trade-off. Obviously it can only be so precise, however. That’s how a given model can predict below chance even over an entire season. The noise in the predictions can get you. So one of my goals as I tweak the models is to see if there’s anything I can do to tighten the predictions up a bit. There might not, but hopefully there’s something to be done.