Euro 2020(1) — Chapter 4: QF Predictions

Soumya Basu
6 min readJul 2, 2021

Chapter 4: QF Predictions — Time to put my money where my mouth is!

We’re into the Quarter Finals and till now, my Algo team hasn’t performed too badly. Sure, in some cases it may have predicted players to score highly (Burak Yilmaz and Nicola Vlasic comes to mind) but then again, the data and the reasoning behind the whole experiment was surrounded by the caveat that I was translating club form onto an International stage. In some cases, I would argue that my model did predict the high scoring players correctly (eg: Gareth Bale in GW3, Haris Seferovic in GW4 — more on that in the next article) but introducing personal bias into the team selection prevented me from capitalising on the performance. Based on these, I have gained enough trust around the model and thus I’m taking the next step onto predicting outcomes of each matches.

Method to the madness:

With the availability of xG and shot% data grouped on a team level; my first preference was towards a linear regression model to predict the number of goals scored by a team. That could have formed a basic model to build upon. However, I wanted to introduce the ‘Team Difficulty Factor’ index into the calculation to get a better degree on how a team is performing in the current Euro’s. Since the number of goals scored is discrete in nature, so it would be difficulty to overlook the condition of normality for the linear regression model to perform well (also there was not much unique variables to arrive at an equation in my opinion). Given the discrete nature of predictor variable and the lack of features and samples (i.e., matches), I approached the prediction problem based on the idea of implementing a Generalised Linear Model using Poisson distribution families.

Data:

Similar to the Fantasy team experiment, I obtained the data for all matches played during this Euro from FootyStats (subscription-based service) which contained data relating to xG (expected goals per team on a match level), actual goals scored, shots on and off target and a host of related match data. Into this I factored in my own ‘Team Difficulty Factor’ index (as derived for the Fantasy team) into the matches to get a baseline of match difficulty from the perspective of both the teams.

Team Strength:

Like the earlier Team Difficulty Factor, I have gone with a rolling Team Difficulty Factor Index with the indexes being updated in the Round of 16 and the current Quarter Final based on performances in the group stages and the round of 16 respectively. The Difficulty Factor is essentially made up of performances of each team in various stages in the competition and leading up to the competition with an additional weight (or degree of importance) assigned to each on the level of importance.

Fig: Group Stage Team Strength

For example, for calculating the Team Difficulty Factor for the Group Stages I considered a 50% importance on the total valuation of the team in Fantasy League (with the underlying assumption that UEFA assigns valuations on an algorithm which is correlated with the team’s strength & chances), a 35% importance on the recently conducted UEFA Nations League, a 10% importance on the Euro Qualifiers (less importance due to the ageing of the tournament) and finally a 5% importance on Home advantage for teams playing in front of their own fans. During the Round of 16 calculations, these weights were updated to 40%, 10%, 8% and 2% for total valuation, Nations League performance, Euro Qualifiers and for Home advantage respectively — in addition I introduced a 40% importance factor for Group stage performance — which is calculated as a function of points earned in the group stages and rank of the opponents faced (in line with the index calculation for Nations League & Qualifiers).

Fig: Round of 16 Team Strength

However, for the Quarter finals, I further amended the scoring criteria and entirely discontinued factoring in the Nations League, Euro qualifier performances as well as the home advantage. For the Quarters the importance factor was updated to a 40% importance each on Group Stage and Round of 16 performance, and a 20% importance Fantasy League valuation.

Once the indices we gathered they were Normalised to nullify the differences in scale for each using MinMax Scaling to arrive at the final team strength. The compliment value of the opposition strength was then imputed against each team to factor their respective chances of progressing.

Fig: Quarter Finals Team Strength

Model:

Fig: GLM Model Summary using Poisson family

In order to fit the Poisson distribution against the Generalised Linear Model, the dependant variable i.e., the goals scored was passed into the model as a function of the team and the opposition difficulty. So essentially, what I expect the model to spit out is expected average number of goals scored by a team when playing against an opponent with a difficulty rating as passed within the model. What the coefficient (slope of the linear regression line) tells us is a higher chance of the respective team to score a goal on average. For example, it tells us that Germany with a coeff of 0.0679 has a higher chance of scoring on average than England (with a coeff of -0.1960) or in other words, Germany are better scorers than England on average. Here I’ve built the model on the assumption of the goals that would be scored by the team and not considered each team’s defensive capacity — may look to incorporate that in the future.

Results:

Originally, I had planned on calculating odds of each teams winning and the odds for a draw but since the model I settled on uses the opposition strength rating, the model doesn’t per se calculate the head to head chances in a match.

One point to be noted here is that the following results show the expected number of goals to be scored by the team.

Fig: Predictions

From the looks of it, my predictions does indeed show the Quarter finals to be quite high scoring! If this does come anywhere close to being true, then it should be a treat to watch for all football fans.

Takeaways:

  • Calculate and check the goodness of fit for the model to better derive predicted outcomes.
  • Factor in Goals conceded and build separate model to factor in expected goals against — using a combination of which arrive at a better score line prediction.
  • Lastly, yes I am confident on the model for predicting expected goals scored by each team and so I am fully considering putting my money where my mouth is and calling that Denmark, England, Italy, and Spain to be making up the Semi Final lineup (Denmark edging Czech Republic on the basis that the predictions were rounded to the nearest integer — Denmark xG = 3.3 whereas Czech Republic xG = 2.7).

--

--

Soumya Basu

Data Science student and wannabe Fantasy Football expert / hipster