The CFL bye weeks are upon us earlier than normal this year, which allows us to take a step back and assess the performance of “The Machine’s” CFL Power Rankings.
You may recall that The Machine made a triumphant debut last year – its mathematical formulation for predicting team power was incredibly accurate in 2011. And so far, The Machine looks to be doing even better in 2012! Read on for a quick overview on how The Machine determines its rankings, and a summary of results to this point in 2012.
How does The Machine determine the Power Rankings?
The Machine determines the rank of a team, in part, by using a statistical method called “regression analysis”
What is Regression Analysis?
Regression analysis is a statistical technique that creates a mathematical equation that describes the relationship between two or more variables. That sounds like a lot, so a simple example might help us understand it a bit better:
Let’s say you’re in the market to buy a new house, and are interested in the impact certain attributes have on the price of a house. Let’s assume that you believe that house size and the number of bathrooms are two of the most important attributes. Now, what you could do is select a sample of houses (let’s say you find 15) that are for sale in the area. For each of these 15 houses you could identify the price, the size and the number of bathrooms. Once all this information is compiled, you could then run a “regression” (using a computer software program – you would not want to do this on the back of an envelope!) which would provide you with a regression equation.
What is the Regression Equation?
The regression equation is one of the outputs of a regression analysis. In our housing example above, let’s assume that our regression equation turned out to be:
House Price ($) = $50,000 + $175 x square feet + $25,000 x # of bathrooms
What this equation is saying is that, based on our sample of 15 houses, the price of a house is equal to $50,000 plus the size of the house, multiplied by $175, plus the number of bathrooms, multiplied by $25,000.
Isn’t the Regression Equation just an estimate? If so, what is the purpose of doing it?
Yes, regression equations are just estimates. But these estimates can be quite helpful. For example, in our housing example above, we might find a property we really like that is 2000 square feet, has two bathrooms and costs $500,000. Is the property a good buy?
Well, based on our regression model, a 2000 square foot property with two bathrooms has an estimated price of $400,000. So, comparing this with the actual list price of $500,000 might prompt us to ask additional questions as to what factors are causing the list price of the house to be higher than our estimate.
No regression equation can ever account for all the different variables that exist. Right?
Probably. But the great thing about performing a regression analysis is that, aside from generating the regression equation, we’re actually given information on the usefulness (or “predictability”) of the equation. This is given to us in the form of a percentage (it’s called “rsquared”), where a higher percentage indicates a higher level of predictability. For example, let’s assume that our regression analysis for our house example above provides us with an rsquared of 25%. What this means is that 25% of a house price is dependent upon the size of the house and number of bathrooms. 75% is therefore dependent upon some other factors that we have not measured. As a statistician, you may choose to identify more variables (say, size of yard, location, age) to see if it increases the predictability of the equation.
Another great thing about regression analysis is that it also tells us whether an individual variable contributes to the predictability of the model. In our house example, we might find that including the age of the house does nothing to improve the predictability of our equation. If so, we would exclude it from any equation we use.
What does all of this have to do with CFL football?
Well, we can apply the concepts discussed above to the results of each CFL game. That is, the score of a CFL game is a function of a whole bunch of things that happen during a game, much like the price of a house is a function of a whole bunch of things about that house.
Can you tell me about how The Machine determined the Power Rankings formula?
The regression equation was determined by analyzing CFL game results from 2002 to 2011. All official CFL statistics were included in the analysis, from Penalty Yards, to Quarterback efficiency rating, to Field Goals made (and missed). All of this data was input into a software program and, after a number of iterations, it was determined that the number of a team’s points scored in any given game can be estimated using the following formula:
2.99 x number of kickoffs, plus
0.125 x Quarterback Rating, plus
0.0246 x Rushing Yards, plus
0.0238 x Punt Return Yards, plus
0.467 x KickOff Returns, plus
8.38 x Time of Possession (%), less
1.54 x Turnover on Downs, less
0.537 x Sacks Taken, less
0.314 x Punts, less
0.562 x Fumbles, less
4.46
How do we know if the formula above is any good?
Well, the first thing we can do is look at the rsquared statistic, as we discussed above. Before any 2012 games were played, we were reasonably confident that the Machine would be using a strong model, as the rsquared of the formula was 75%. In other words, the variables we’ve included above are able to predict 75% of what actually occurs in a game.
Why weren’t penalty yards included in the formula?
We’ve only included those variables shown – statistically – to have an impact on the model’s rsquared (that we discussed above). Penalty yards do not have any meaningful statistical impact on the number of points a CFL team scores, based on our analysis.
Why don’t you continue to add variables until the model is 100% predictive?
We’ve added as many variables as we can. The reason the Machine only has 75% predictability is that there are certain things that even math can’t explain. That is, it’s quite likely that much of the 25% remaining is made up of the “human” element of the game that either we do not, or cannot, measure.
Enough already! How has the Machine done this year?
Using the formula above, we present the estimated scores alongside the actual scores for each game played so far. You’ll see that the Machine has been “perfect” in predicting the correct winners of each game, and has been particularly strong is highlighting how close some games have actually been (see game number 14, which was the overtime thriller between Calgary and Saskatchewan). We encourage you to play along with the Machine the rest of the year!
Game # 
Date 
Visitor 
Home 


Club 
Actual 
Estimated 
Club 
Actual 
Estimated 
1 
June 29 
SSK 
43 
40.7 
HAM 
16 
27.6 
2 
June 29 
WPG 
16 
18.9 
BC 
33 
38.3 
3 
June 30 
TOR 
15 
19.5 
EDM 
19 
23.9 
4 
July 1 
MTL 
10 
12.9 
CGY 
38 
33.7 
5 
July 6 
WPG 
30 
35.1 
MTL 
41 
46.0 
6 
July 6 
HAM 
36 
34.4 
BC 
39 
38.9 
7 
July 7 
CGY 
36 
31.9 
TOR 
39 
40.5 
8 
July 8 
EDM 
1 
2.3 
SSK 
17 
15.7 
9 
July 12 
CGY 
32 
26.0 
MTL 
33 
34.2 
10 
July 13 
WPG 
10 
8.8 
EDM 
42 
31.5 
11 
July 14 
BC 
20 
26.2 
SSK 
23 
27.4 
12 
July 14 
TOR 
27 
25.6 
HAM 
36 
26.7 
13 
July 18 
WPG 
22 
22.7 
TOR 
25 
27.9 
14 
July 19 
SSK 
38 
33.1 
CGY 
41 
33.7 
15 
July 20 
EDM 
27 
26.6 
BC 
14 
21.3 
16 
July 21 
MTL 
24 
25.3 
HAM 
39 
39.6 
17 
July 26 
EDM 
22 
19.6 
WPG 
23 
20.2 
18 
July 27 
TOR 
23 
31.9 
MTL 
20 
30.5 
19 
July 28 
HAM 
35 
36.1 
SSK 
34 
34.8 
20 
July 28 
BC 
34 
41.7 
CGY 
8 
13.7 