## Odds Assessment on Football Matches

Studenteropgave: Speciale (inkl. HD afgangsprojekt)

- Tobias Christensen
- Rasmus Dencker Hansen

4. semester, Datalogi, Kandidat (Kandidatuddannelse)

The thesis presents models for automatic assessment of probabilities for
outcomes of football matches. The motivation is the enormous turnover
in betting, hence the focus is on creating a probability assessor which has
the same or a better performance than the bookmakers when assessing the
outcomes of football matches.

Two datasets are used: a detailed containing match events covering the last three years and a results-only covering the last six years. The detailed match data is quantified so that it consist of category counts e.g. the number of shots on target. The models are only based on data available prior to the start of the matches, hence primarily data about the previous matches.

The probability assessors are built using k-Nearest Neighbor, decision trees, regression on goals and combinations of those made by ensemble methods. In order to find the best settings for k-Nearest Neighbor a wrapper approach was used to search for the best feature combination and for each feature subset the best k value was found. Decision trees were modified to output probability assessments instead of classifications and different lower bounds on leaf node sizes were tested. The regression of goals were done by assuming a Poisson distribution of goals and then deciding an offensive strength and a defensive weakness for each team based on a maximum likelihood estimate. Ensemble methods were used to create random forests based on the decision tree algorithm and to combine other previously created probability assessors.

The evaluation of the probability assessors is two split; first they are evaluated for decision support for bookmakers and then they are evaluated for gambling purpose. The evaluation for bookmaking was made indirect by applying the logarithmic and quadratic scoring rules to the probability assessments. In addition a pair wise domain specific scoring rule were created, the Bookmaker-Gambler scoring rule, which lets two probability assessors compete by letting them shift between being bookmakers and gamblers. The evaluation for gambling was made by letting each probability assessor bet on outcomes where it would expect a gain.

The results indicate that it is difficult to make models which are better than the bookmakers but that it is possible to almost match them. The regression model is the best of the created probability assessors and it performs almost as well as the bookmakers in most tests. The tests indicated that a much larger dataset is needed for the decision tree algorithm to perform well. The regression model is only based on goals so there is room for further improvements by including other factors.

Two datasets are used: a detailed containing match events covering the last three years and a results-only covering the last six years. The detailed match data is quantified so that it consist of category counts e.g. the number of shots on target. The models are only based on data available prior to the start of the matches, hence primarily data about the previous matches.

The probability assessors are built using k-Nearest Neighbor, decision trees, regression on goals and combinations of those made by ensemble methods. In order to find the best settings for k-Nearest Neighbor a wrapper approach was used to search for the best feature combination and for each feature subset the best k value was found. Decision trees were modified to output probability assessments instead of classifications and different lower bounds on leaf node sizes were tested. The regression of goals were done by assuming a Poisson distribution of goals and then deciding an offensive strength and a defensive weakness for each team based on a maximum likelihood estimate. Ensemble methods were used to create random forests based on the decision tree algorithm and to combine other previously created probability assessors.

The evaluation of the probability assessors is two split; first they are evaluated for decision support for bookmakers and then they are evaluated for gambling purpose. The evaluation for bookmaking was made indirect by applying the logarithmic and quadratic scoring rules to the probability assessments. In addition a pair wise domain specific scoring rule were created, the Bookmaker-Gambler scoring rule, which lets two probability assessors compete by letting them shift between being bookmakers and gamblers. The evaluation for gambling was made by letting each probability assessor bet on outcomes where it would expect a gain.

The results indicate that it is difficult to make models which are better than the bookmakers but that it is possible to almost match them. The regression model is the best of the created probability assessors and it performs almost as well as the bookmakers in most tests. The tests indicated that a much larger dataset is needed for the decision tree algorithm to perform well. The regression model is only based on goals so there is room for further improvements by including other factors.

Sprog | Engelsk |
---|---|

Udgivelsesdato | jun. 2007 |