August 7, 2012

Hardy: The man behind the Power Rankings Machine

The CFL bye weeks are upon us earlier than normal this year, which allows us to take a step back and assess the performance of “The Machine’s” CFL Power Rankings.

You may recall that The Machine made a triumphant debut last year – its mathematical formulation for predicting team power was incredibly accurate in 2011.  And so far, The Machine looks to be doing even better in 2012!  Read on for a quick overview on how The Machine determines its rankings, and a summary of results to this point in 2012.

How does The Machine determine the Power Rankings?

The Machine determines the rank of a team, in part, by using a statistical method called “regression analysis”

What is Regression Analysis?

Regression analysis is a statistical technique that creates a mathematical equation that describes the relationship between two or more variables.  That sounds like a lot, so a simple example might help us understand it a bit better:

Let’s say you’re in the market to buy a new house, and are interested in the impact certain attributes have on the price of a house.  Let’s assume that you believe that house size and the number of bathrooms are two of the most important attributes.  Now, what you could do is select a sample of houses (let’s say you find 15) that are for sale in the area.  For each of these 15 houses you could identify the price, the size and the number of bathrooms.  Once all this information is compiled, you could then run a “regression” (using a computer software program – you would not want to do this on the back of an envelope!) which would provide you with a regression equation.

What is the Regression Equation?

The regression equation is one of the outputs of a regression analysis.  In our housing example above, let’s assume that our regression equation turned out to be:

House Price ($) = $50,000 + $175 x square feet + $25,000 x # of bathrooms

What this equation is saying is that, based on our sample of 15 houses, the price of a house is equal to $50,000 plus the size of the house, multiplied by $175, plus the number of bathrooms, multiplied by $25,000.

Isn’t the Regression Equation just an estimate?  If so, what is the purpose of doing it?

Yes, regression equations are just estimates.  But these estimates can be quite helpful.  For example, in our housing example above, we might find a property we really like that is 2000 square feet, has two bathrooms and costs $500,000.  Is the property a good buy?

Well, based on our regression model, a 2000 square foot property with two bathrooms has an estimated price of $400,000.  So, comparing this with the actual list price of $500,000 might prompt us to ask additional questions as to what factors are causing the list price of the house to be higher than our estimate.

No regression equation can ever account for all the different variables that exist.  Right?

Probably.  But the great thing about performing a regression analysis is that, aside from generating the regression equation, we’re actually given information on the usefulness (or “predictability”) of the equation.  This is given to us in the form of a percentage (it’s called “r-squared”), where a higher percentage indicates a higher level of predictability.  For example, let’s assume that our regression analysis for our house example above provides us with an r-squared of 25%.  What this means is that 25% of a house price is dependent upon the size of the house and number of bathrooms.  75% is therefore dependent upon some other factors that we have not measured.  As a statistician, you may choose to identify more variables (say, size of yard, location, age) to see if it increases the predictability of the equation.

Another great thing about regression analysis is that it also tells us whether an individual variable contributes to the predictability of the model.  In our house example, we might find that including the age of the house does nothing to improve the predictability of our equation.  If so, we would exclude it from any equation we use.

What does all of this have to do with CFL football?

Well, we can apply the concepts discussed above to the results of each CFL game.  That is, the score of a CFL game is a function of a whole bunch of things that happen during a game, much like the price of a house is a function of a whole bunch of things about that house.

Can you tell me about how The Machine determined the Power Rankings formula?

The regression equation was determined by analyzing CFL game results from 2002 to 2011.  All official CFL statistics were included in the analysis, from Penalty Yards, to Quarterback efficiency rating, to Field Goals made (and missed).  All of this data was input into a software program and, after a number of iterations, it was determined that the number of a team’s points scored in any given game can be estimated using the following formula:

2.99 x number of kickoffs, plus
0.125 x Quarterback Rating, plus
0.0246 x Rushing Yards, plus
0.0238 x Punt Return Yards, plus
0.467 x Kick-Off Returns, plus
8.38 x Time of Possession (%), less
1.54 x Turnover on Downs, less
0.537 x Sacks Taken, less
0.314 x Punts, less
0.562 x Fumbles, less

How do we know if the formula above is any good?

Well, the first thing we can do is look at the r-squared statistic, as we discussed above.  Before any 2012 games were played, we were reasonably confident that the Machine would be using a strong model, as the r-squared of the formula was 75%.  In other words, the variables we’ve included above are able to predict 75% of what actually occurs in a game.

Why weren’t penalty yards included in the formula?

We’ve only included those variables shown – statistically – to have an impact on the model’s r-squared (that we discussed above).  Penalty yards do not have any meaningful statistical impact on the number of points a CFL team scores, based on our analysis.

Why don’t you continue to add variables until the model is 100% predictive?

We’ve added as many variables as we can.  The reason the Machine only has 75% predictability is that there are certain things that even math can’t explain.  That is, it’s quite likely that much of the 25% remaining is made up of the “human” element of the game that either we do not, or cannot, measure.

Enough already! How has the Machine done this year?

Using the formula above, we present the estimated scores alongside the actual scores for each game played so far.  You’ll see that the Machine has been “perfect” in predicting the correct winners of each game, and has been particularly strong is highlighting how close some games have actually been (see game number 14, which was the overtime thriller between Calgary and Saskatchewan).  We encourage you to play along with the Machine the rest of the year!


Game # Date Visitor Home
    Club Actual Estimated Club Actual Estimated
1 June 29 SSK 43 40.7 HAM 16 27.6
2 June 29 WPG 16 18.9 BC 33 38.3
3 June 30 TOR 15 19.5 EDM 19 23.9
4 July 1 MTL 10 12.9 CGY 38 33.7
5 July 6 WPG 30 35.1 MTL 41 46.0
6 July 6 HAM 36 34.4 BC 39 38.9
7 July 7 CGY 36 31.9 TOR 39 40.5
8 July 8 EDM 1 2.3 SSK 17 15.7
9 July 12 CGY 32 26.0 MTL 33 34.2
10 July 13 WPG 10 8.8 EDM 42 31.5
11 July 14 BC 20 26.2 SSK 23 27.4
12 July 14 TOR 27 25.6 HAM 36 26.7
13 July 18 WPG 22 22.7 TOR 25 27.9
14 July 19 SSK 38 33.1 CGY 41 33.7
15 July 20 EDM 27 26.6 BC 14 21.3
16 July 21 MTL 24 25.3 HAM 39 39.6
17 July 26 EDM 22 19.6 WPG 23 20.2
18 July 27 TOR 23 31.9 MTL 20 30.5
19 July 28 HAM 35 36.1 SSK 34 34.8
20 July 28 BC 34 41.7 CGY 8 13.7