## A Word on Probability

Part of what I want to do with this blog is spread a little knowledge around.  This will be the first post like that.  Also, it gives background on the technique I’ll use in the next post.  I’m going to try to find the right spot between detail and accessibility, so any feedback would be great.

One of the more powerful techniques in statistics is linear regression.  What it attempts to do is take a bunch of data from, say, two variables and find the line that best describes the relationship between them.  So if I got 100 people and measured their heights and weights and made a graph of them, linear regression would find the equation of the line that best connects height and weight.  If you remember back to grade school, the equation of a line is y = a*x+b, where y and x are the values of your two variables (y is the one you’re predicting from x), a is the slope (how much y changes when x changes by 1), and b is the intercept (what value y has if x is zero).

If you remember another thing about lines, it’s that they’re infinite.  Since there’s no cap on x, there’s no cap on y.  With something like height and weight this is sort of a problem because in theory no one can have a height (or weight) below 0, but it doesn’t matter in practice because no one has a weight (or height; whichever way you’re predicting) of 0.  Put another way, all the data is far enough away from the endpoint that you don’t have to worry about ridiculous numbers as long as you have enough data for a good estimate.  But one place where this is a real problem is when you want to predict something binary, something that only happens or doesn’t happen.  In this case all of your actual y values are 0s or 1s, and the predicted y values come out as proportions (.25, .5, .72, etc).  But since x can take on any value, and y should be close to 0 and 1 sometimes, there are often situations where you predict a value below 0 or above 1, which is impossible.

This sounds sort of abstract, but it becomes important any time you want to predict if a team will win or not.  This is a binary outcome, and our actual prediction will be the probability of a team winning (probability is just 100 times a proportion, as in .25 = 25%, and I’ll probably slip back and forth).  So you obviously don’t want to predict that team A has a 105% chance of winning this game, even if the players tell you they’re giving 110%.  The other issue with probability is that even within the 0 to 1 range, it very rarely changes linearly with some other variable.  Instead it increases slowly at the bottom, becomes somewhat linear through the middle, and then slowly increases again at the top towards 1.  That is to say, probability doesn’t change linearly; a one unit change in x at the top or bottom changes the probability less than the same one unit change in the middle.

So what are we to do if we want to use our excellent linear regression methods with non-linear probability?  We move to what’s called logistic regression.  Skipping over some of the details, we want to find a way to make probability suitable for linear regression.  If you take the probability p and change it into odds, we’re partway there.  Odds is p/(1-p), so it can vary from 0 (when p=0) to infinity (when p=1).  Lines can go all the way to negative infinity as well though, so we’ll take the natural logarithm (natural log) of the odds, and that does the trick.  So now we can take variables of interest and draw a straight line to describe how the log of the odds moves with those variables.  In practice it isn’t as bad as it sounds since most statistics are run on computers now, so you can give it your 0s and 1s and whatever other variables and say ‘run a logistic regression!’, and it’ll do the heavy lifting.

The next post I write will be a bit more conceptual using these ideas in response to a post on Advanced NFL Stats, a site I really like but happen to disagree with on a particular point, and after that I’ll have something on winning playoff series in the NBA and NHL that also uses logistic regression.  Hopefully seeing it in action will make this a bit more clear.