MAFL Online: Modelling AFL Team Scoring : Part II

This is the second in a series of blogs about modelling the scoring of AFL teams.

In the previous blog on this topic I introduced what I called the Score Equation, which represents a way of thinking about a team's score in any game, and is as follows:

Score = Number of Scoring Shots x Conversion Rate x 6 + Number of Scoring Shots x (1 - Conversion Rate)

Then I used empirical data for seasons 2006 to 2009 to show that the bookmakers' starting prices could be used to predict with reasonable accuracy the number of scoring shots that a team will produce in a given game, and that teams, regardless of the number of scoring shots they produce, generally convert about 53.64% of them.

This allowed me to take the Score Equation and turn it into the Expected Score Equation, which allows the prediction of a team's expected score based on its starting price (and that of its opponent) and whether or not it is playing at home.

Expected Score = 69.1047 + 49.116 x Team Probability + 2.7176 x Home Ground Status

Over the period 2006 to 2009 this equation predicted the average scores of teams with an acceptably high level of accuracy.

The purpose of creating the Expected Score Equation was to model teams' expected, or average, scores. If we want to make probability estimates about the result of a game, however, we need more than just expected scores, we need to model the full statistical distribution of each team's scores - we need to know how their scores vary around the expected values.

For this purpose we turn our attention back to the Score Equation and to the empirical relationship between scoring shots and conversion rates, a graphic of which appeared in the earlier blog.

That graphic demonstrates that the variability in conversion rate declines as the number of scoring shots increases (the dots fan out less as we go higher in the graphic) and the average conversion rate remains near-constant as the number of scoring shots increases (the dots fan out roughly equally around a value of about 53.64% as we go higher in the graphic). This pattern suggests that a team's goal scoring might reasonably be modelled as what's called a binomial process.

It's as if teams 'generate' a number of scoring shots - more if they're playing a weaker team or playing particularly well, fewer if the opposite is true - and then convert these shots into points as though they were drawing as many balls as they have scoring shots from a vast urn with about 53.64% of the balls marked "Goal" and the remainder marked "Behind".

Okay, so we know how to model the conversion of scoring shots but, statistically speaking, how do teams generate those scoring shots?

In the previous blog we found that the average number of scoring shots they produce is given by:

Expected Number of Scoring Shots = 18.7683 + 13.3395 x Team Probability + 0.7381 x Home Ground Status

Further empirical analysis of the now much-analysed 206 to 2009 period shows that the variability with which they produce scoring shots around that average follows what is called a log-normal distribution. Two characteristics of this distribution make it particularly well-suited to modelling scoring shots: it's never negative (and nor can the number of scoring shots be negative) and it's left-skewed (so it has more values above the mean than below it, a characteristic shared by team scoring shot behaviour).

To perform this empirical analysis I formed subsets of the historical scoring shot data, with each subset based on the associated team's probability of victory for the match in question. In this way I formed 9 subsets of the data, one containing data for those teams with a victory probability near 10%, another for those teams with a victory probability near 20%, and so on, up to 90%.

The actual scoring shot behaviour of the teams in the games in each such subset tend to follow a lognormal distribution with a standard deviation of around 5.5 scoring shots.

The graphic below shows the distribution of scoring shots for teams within the 9 buckets (the bars) overlaid with a lognormal distribution with the same mean as the actual data and with a standard deviation of 5.5 (the lines). Visually, the adequacy of the fit is apparent.

Okay, so now we're ready to start using our model to draw some conclusions in the next blog.

To summarise, here's what we have:

(1) Team scoring can be modelled by the Score Equation

Score = Number of Scoring Shots x Conversion Rate x 6 + Number of Scoring Shots x (1 - Conversion Rate)

(2) A team's number of scoring shots can be modelled by a lognormal distribution with mean given by

Expected Number of Scoring Shots = 18.7683 + 13.3395 x Team Probability + 0.7381 x Home Ground Status

and with a standard deviation of 5.5.

(3) Teams will convert the scoring shots they produce into goals as if drawing from a binomial distribution with probability of success equal to 53.64%

Easy really - should have thought of it sooner.

Sunday, April 25, 2010

Modelling AFL Team Scoring : Part II

No comments: