Thursday, July 8, 2010

Building The Super Smart Model

In a previous blog we investigated the information content of bookie prices in the context of predicting the victory margin of a game.


There I introduced this Bookie Model:

Predicted Victory Margin = 20.7981*ln(Prob) + 56.8537*Prob^2

where 'Prob' was the relevant team's probability (and ln is the natural log).

For comparison, I created what I called the Smart Model, constructed using only the information about a team's recent form and the location of the game under consideration; importantly, in constructing this model, I did not use any information about the team's market price.

This was the Smart Model:

Predicted Victory Margin = 0.091119*Ave_Res_Last_2 + 672.347*Own_MARS - 672.483*Opp_MARS + 13.9791*Interstate_Clash

where,
Ave_Res_Last_2 is the averaged result of the team's last 2 games,
Own_MARS is the team's own MARS Rating prior to the game divided by 1000,
Opp_MARS is the team's opponent's MARS Rating prior to the game divided by 1000, and
Interstate_Clash = +1 if the team is playing in its home state and its opponent is playing interstate, -1 if the reverse is true, and 0 if neither or both teams are playing interstate.

The Smart Model performed roughly as well as - on some metrics slightly better than and on others slightly worse than - the Bookie Model. This was true both in-sample, which was the period from Round 13 of 1999 to the end of 2009, and post-sample, which was the first 12 rounds of season 2010.

Because some of the bookie data I used in the construction of the Bookie Model was not data I'd collected myself, I did wonder if I'd unfairly handicapped the bookies - about time, I hear some of you say - in making this comparison. It turns out that I hadn't.

I've since rerun the entire model-building exercise using only the data from 2006 to 2009 - which is the data I've collected myself - and found that a Smart Model built only on that data also performs roughly as well as a Bookie Model built only on that same data.

I'm therefore now far more confident in stating that bookmaker market prices contain little if any information that predicts victory margins better than that which can be statistically extracted from recent results and from knowledge of the impact of game venue.

Still, that doesn't mean that the bookie data has nothing unique to offer in predicting victory margins. In the prediction caper it's common to find out that the combined wisdom of a diverse set of informed and motivated opinion-holders is superior to the individual wisdom of any single opinion-holder.

Here then is the Super Smart Model, which is the best model I've found so far that uses all the data that the Smart Model used, but also uses the bookie's probability:

Predicted Victory Margin = 207.811 + 73.9587*Prob + 0.124386*Ave_Res_Last_2 + 8.54291*Interstate_Clash - 245.292*Opp_MARS

It's not a pretty model but it is predictive.

For the period 1999 (Round 13) to end 2009 its Mean APE is 29.61, its Median APE is 24.91 and its R-squared is 26.7%. These are all superior to the equivalent Bookie Model and Smart Model figures.

Across the first 12 rounds of 2010, its Mean APE is 28.21, its Median APE is 24.03, and its R-squared is 34%. The Smart Model has a superior Median APE (23.64) but the Super Smart Model has the best Mean APE and R-squared.

For the more visual amongst us, here's a graph of the Smart Model victory margin predictions against the actual results.



Well it's one thing to be smart, but another thing entirely to be profitable.

The most obvious way to test the commercial exploitability of the Smart Model is to see how it would have fared making line bets on the strength of its predictions. To do this we need to account for the fact that the Smart Model makes two victory margin predictions for each game, one for each team, and that these predictions will be different. Let's do what's simplest and average them.

A few examples might help show how all this works, so here's how we might use the Super Smart Model for the current Round:



Had we adopted a similar approach to that depicted in the examples above and made line bets accordingly from the start of season 2006 through Round 12 of the current season we'd have correctly predicted 54.1% of line winners, which is a high enough proportion to spin a small profit across that entire period. In addition, we'd have made money in every season taken individually except 2006, during which we'd have recorded about a -4% ROI. For the profitable seasons, ROIs would have ranged from 0.9% (this year to Round 12) to 10.1% (last year).

So, at this stage, the Super Smart Model is a candidate model for season 2011.

No comments: