During the off-season last year I was thinking about some discussions I had had with another person in the computer rankings world during the prior off-season. Basically my belief has been that the Vegas line is unbiased and therefore which team covers the spread was a random event. So it wouldn’t matter if a computer system is 1 point greater than the line or 7 points, each would be equally likely (roughly 50%) to be correct against the spread. People will write to me and say I should or ask me to do the numbers against the spread for subsets, for example when the computer predictions are 3 points or more from the line. My observations with this over the years was that these cut points didn’t really improve the numbers any. So are not as useful as people think they would be.
So I was thinking about ways to test this belief. Pondering that lead me to look at the line. The line is approximately unbiased. If you look at the bias numbers they are historically centered around zero. That in itself goes a long way towards explaining why it is hard to be much better than 50% betting against the line. But then if you look at the absolute error there appears to be a disconnect. On average the bias is close to zero while the average absolute error is 10-11 points. So even though the difference between the line and actual score is close to zero on average, the line is off by 10 points or more more than 50% of the time. That is mind blowing when you think about it. Shouldn’t that imply that there is a lot of opportunities to beat the spread? If the line is off by so much so often then why can’t someone or some system consistently find those holes? Personally, I think it goes back to my original theory, that the difference between the line and the actual score is random and centered close to zero.
So I started to think of any ways that you might at least be able to reduce the amount of this somewhat extreme variability or random error. It seemed clear that the most likely candidate was turnovers, which are generally considered to be random bounces of the ball. Hmm, random error, random turnovers, sounds like there could be a connection. The problem was I don’t collect individual game statistics so I couldn’t investigate this idea. I eventually found a source of data so that I could look at the 2010 season. What I found was that the turnover margin in a game explained roughly 40% of the difference between the line and the actual outcome. I got very excited. The 2011 preseason was about to start so I needed to come up with a system that incorporated the turnovers along with the scores. So what I did was very simple. I just added it as a new variable in the least squares regression model that I have been running for years. For these models the outcome is the score differential. The variables are a matrix of the games. For each game a variable for the home team is equal to 1, and the variable for the road team is equal to -1. All other team variables have a value of zero. To this I added the turnover margin for the game. I knew the results early in season would be meaningless because this system is based only on games of the current year. So it could take some time before it became stable. When it did kick in it really kicked in. As you can see from the NFL prediction tracker results page this new system came in first place in 3 out of the 5 categories over the second half of the season.
I’m digging into the numbers a little bit more here after the season is over. Looking at the actual regression models for this season the turnovers explained a little less than 40% of the error in the line, but it was consistently in the 35-40% range all season long. If you look at the regression models with and without the turnover variable. The R-square of the model with turnovers is 0.60 and the R-square of the model without the turnover variable is 0.33. That is a very large difference for only adding one variable to a model. But of course predicting future games is very different from fitting prior games. So I was never expecting to see the mean error in predictions to drop by 40%. That would mean reducing it from 10 points a game down to about 6. So how well did it improve the original least squares predictions? In straight up game winners it was 4 games better. Against the spread it was 11 games better. For absolute error it was a about a 6.8% improvement, and for mean square error it was an 11% improvement. So all in all I think the results were very good. Now it will be interesting to see if the results are repeatable year to year.
My original thought was to go another step in first trying to predict the turnover margin in a game and then plug that into the model to see if that further improved the predictions. The problem was that I wasn’t able to find a way to reliably predict a turnover margin between two teams. That does appear to be pretty random. So for now the predictions from this model are predictions assuming that the turnover margin will be zero. If you are curious, a turnover was worth an average of 4.53 points this past season. So if a team was favored to win by 3 points and they were +1 in turnovers in the game the averaged a win by 7.5 points. If they were a 3 point favorite and -1 in turnovers in the game then they averaged losing by 1.5 points. So you see why the favorites don’t always win. A 3 point favorite that loses the turnover game loses the game. A touchdown favorite can loose a game by being -2 in turnovers. The average turnover margin was +0.21 in favor of the home team, so I could have possibly tried adding .21*4.53=0.95 points to each home team.
I’d be interested in hearing anyone’s thoughts for or against my theory of the winner against the spread being random or any other ideas the explain even more of the error in the line.