Monday, August 16, 2010

Letting the Computer Do (Most of) the Work

Around this time of year it's traditional to work through the remaining matches for each team and attempt to codify what each needs to do in order to secure a particular finish - minor premiership, top 4, top 8 or Spoon.

This year, rather than work through all the combinations manually, I've decided to be lazy - purely for instructional purposes, I should add - and enlist the help of rule induction, a mathematical technique for deducing from a dataset statements in the form If A and B then C that describe key variables in that data.

So, for example, if you were to apply the technique to help describe the use of heating and cooling appliances by a household over the course of a few years you might collect information several times each day about who was home, what the outside temperature was, what day of the week and time of day it was, and whether or not a heating or a cooling appliance was turned on.

Using a rule induction algorithm, you'd be able to come up with statements such as this one:
If Number of People Home is greater than 0 AND Outside Temperature is less than 15 degrees AND Time of Day is between 5:30pm and 11:30pm AND Day of Week is not Saturday or Sunday then Heating = ON (Probability 92%)

For this blog I provided a rule induction algorithm (the JRip Weka algorithm running in R, if you're curious) with the outputs from 10,000 of the simulations I used in my earlier blog, which included for each simulation:
  • The results of each of the remaining 16 games
  • The final ladder positions of each team if these were the actual results of each game
To simplify matters a little, and recognising that the main interest is not in exact ladder position finishes, I summarised each team's finishing position as either "1st","2nd to 4th","5th to 8th","9th to 15th", or "16th".

The goal was that the rule induction algorithm would output rules of the form:
If X beats Y AND X beats Z AND ... then X finishes 5th to 8th

Rule induction worked remarkably well. Here are a few real examples of the rules that the algorithm offered up for Collingwood's fate:

Rule 1: (Collingwood..v..Adelaide <= 0) and (Hawthorn..v..Collingwood >= 0) and (Carlton..v..Geelong <= 0) => Collingwood = 2nd to 4th (168.0/2.0)

Rule 2: => Collingwood =1st (9832.0/0.0)

Rule 1 can be interpreted as follows:
If Collingwood loses to or draws with Adelaide (ie the margin in that game, couched in terms of Collingwood is less than or equal to zero) AND Collingwood loses to or draws with Hawthorn AND Geelong beats or draws with Carlton then Collingwood finish 2nd to 4th.

What's implicit here is that Geelong also beats West Coast but since, in the simulations, this always occurred when the other conditions in the rule were met, the algorithm didn't realise that this was an additional required condition.

As well, Collingwood can't be allowed to draw both its games otherwise Geelong can't overhaul them. Again, this situation didn't occur in the simulations I provided the algorithm, and not even the smartest algorithm can intuit instances that it's never seen.

I could probably have fixed both of these shortcomings by providing the algorithm with more than 10,000 simulations, though I'd pay a price in terms of computation time. Note though the (168.0 / 2.0) annotation at the end of this rule. That tells you that the rule could be applied to 168 of the simulations, but that it was wrong for 2 of them. Maybe the two simulations for which the rule applied but was incorrect included a Geelong loss to the Eagles or two draws for Collingwood.

Rule creation algorithms include what's called a "stopping rule" to prevent them from creating a unique rule for every simulation result, which might make the rules highly accurate but also makes them completely impractical.

Rule 2 is the "otherwise" rule and is interpreted as the predicted outcome if none of the earlier rules' full set of conditions are met. For Collingwood, "otherwise" is that they finish 1st.

The rules provided for other teams were generally quite similar, although they became more complex for teams when percentages were required to determine crucial ladder positions. Here, for example, are a few of the rules where the algorithm is attempting to model Hawthorn getting bumped into 9th by Melbourne:

(Hawthorn..v..Fremantle <= -7) and (Port.Adelaide..v..Melbourne
<
= -14) and (Melbourne..v..Kangaroos >= 20) and (Hawthorn..v..Collingwood
<
= -39) and (Melbourne..v..Kangaroos
>
= 44) =
>
Hawthorn = 9th to 15th (54.0/2.0)

(Hawthorn..v..Fremantle <= -4) and (Port.Adelaide..v..Melbourne
<
= -7) and (Melbourne..v..Kangaroos
>
= 11) and (Hawthorn..v..Collingwood
<
= -59) and (Port.Adelaide..v..Melbourne
<
= -32) =
>
Hawthorn = 9th to 15th (41.0/3.0)

Granted that's a mite convoluted, but nothing that a human can't recognise fairly quickly, which nicely illustrates my experience with this type of algorithm: their outputs almost always contain some useful insights but the extraction of this insight requires human interpretation.

What follows then are the rules that man and machine have crafted for each team (note that I've chosen to ignore the possibility of draws to reduce complexity)

Collingwood
Finish 2nd to 4th if Collingwood lose to Adelaide and Hawthorn AND Geelong beat Carlton and West Coast
Otherwise finish 1st

Geelong
Finish 1st if Collingwood lose to Adelaide and Hawthorn AND Geelong beat Carlton and West Coast
Otherwise finish 2nd to 4th

St Kilda
Finish 2nd to 4th

Western Bulldogs
Finish 5th to 8th if Dogs lose to Essendon and to Sydney AND Fremantle beat Hawthorn and Carlton
Otherwise finish 2nd to 4th

Fremantle
Finish 2nd to 4th if Dogs lose to Essendon and to Sydney AND Fremantle beat Hawthorn and Carlton
Otherwise finish 5th to 8th

Carlton and Sydney
Finish 5th to 8th

Hawthorn
Finish 9th to 15th if Hawthorn lose to Fremantle and Collingwood AND Roos beat West Coast and Melbourne
Also Finish 9th to 15th if Hawthorn lose to Fremantle and Collingwood AND Melbourne beat Port and Roos sufficient to raise Melbourne's percentage above Hawthorn's
Otherwise finish 5th to 8th

Kangaroos
Finish 5th to 8th if Hawthorn lose to Fremantle and Collingwood AND Roos beat West Coast and Melbourne
Otherwise finish 9th to 15th

Melbourne
Finish 5th to 8th if Hawthorn lose to Fremantle and Collingwood AND Melbourne beat Port and Roos sufficient to raise Melbourne's percentage above Hawthorn's
Otherwise finish 9th to 15th

Adelaide, Port Adelaide and Essendon
Finish 9th to 15th

Brisbane Lions
Finish 16th if Lions lose to Essendon and Sydney AND West Coast beat Geelong and Roos sufficient to lift West Coast's percentage above the Lions' AND Richmond beat St Kilda or Port (or both)
Otherwise finish 9th to 15th

Richmond
Finish 16th if West Coast beat Geelong and Roos AND Richmond lose to St Kilda and Port Otherwise finish 9th to 15th

West Coast
Finish 9th to 15th if West Coast beat Geelong and Roos AND Richmond lose to St Kilda and Port Finish 9th to 15th if West Coast beat Geelong and Roos AND Lions lose to Essendon and Sydney sufficient to lift West Coast's percentage above the Lions'
Otherwise finish 16th

As a final comment I'll note that the rules don't allow for the possibility of Sydney or Carlton slipping into 4th. Although this is mathematically possible, it's so unlikely that it didn't occur in the simulations provided to the algorithm. (Actually, it didn't occur in any of the 100,000 simulations from which the 10,000 were chosen either.)

A quick bit of probability shows why.

Consider what's needed for Sydney to finish fourth.
1. The Dogs lose to Essendon and Sydney
2. Sydney also beat the Lions
3. Fremantle don't win both their games

Furthermore, combined, Sydney and the Dogs' results have to close the percentage gap between the two teams, which currently stands at over 25 percentage points.

Now 1, 2 and 3 are independent results, so we can multiply their probabilities. I'd estimate 1 as having about a 15% probability, 2 about a 60% probability and 3 about an 80% probability, so our initial probability estimate is 15% x 60% x 80% which is about 7%.

But the 15% and 60% figures just relate to the probability of the required result, not the probability that the wins and losses will be big enough to lift Sydney's percentage above the Dogs'. If Sydney were to trounce the Lions by 100 points and Essendon were to do likewise to the Dogs, then Sydney would still need to beat the Dogs by about 91 points to achieve such a lift.

So let's revise the probability of 1 down to 0.01% (which is probably generous) and the probability of 2 down to 5% (which is also generous). Then the overall probability is 0.01% x 5% x 80%, or about 1 in 250,000. Not gonna happen.

(For similar reasons there are also no rules for Fremantle dropping a game but still grabbing 4th from the Dogs on the basis of a superior percentage.)

No comments: