I’ve decided to make a post on the poisson distribution for football and how it can be used in R. I think that this is a good starting point to familiarize yourself with probability distributions for football.

The poisson distribution is a discrete probability distribution which expresses the probability of a number of events occurring in a fixed interval, eg. the probability that Liverpool score 2 goals in a fixed period of 90 minutes. The probabilities are calculated using a rate parameter λ, which describes the amount of goals that we would expect to occur in this fixed interval.

This is the probability mass function for the poisson, which allows us to calculate the probability of a number of events X occuring in a fixed time interval, given the rate parameter λ:

If we are analysing a game, we can calculate the probability of a team scoring X goals, given their expected goals value λ. Subbing into this formula, the probability that Liverpool would score 1 goal if we take λ to be the expected goals value 1.5 is ~33%. We can perform this calculation for each number of goals up to about 10 to get close to a full probability distribution.

We can also repeat this process for the opposing team. If we have the probability that both teams score X goals, for X = 0,1,2,3…. , we can calculate the probability of a win, loss and draw for both teams. If Bournemouth accumulate 1.5 expected goals against Liverpool in the same game, the probability that the score ends 1-1 is (0.33)x(0.33) ~ 11%. *Note that this method is assuming independence of goals scored by one team when considering how many goals the other team scores, but I’ll talk a little more about that at the end.

### R implementation

This is how we would go about implementing this method in R:

First, load in some expected goals values for a team. I’ve taken data from fbref on all Liverpool games in the premier league this season. I’ve also used the ggplot and dplyr packages, so you will have to install them to run this code. https://fbref.com/en/squads/822bd0ba/Liverpool

library(dplyr)

library(ggplot2)

LFCMatches <- read.csv(‘LFCMatches.txt’)

The function dpois(X , λ ) performs the calculation explained above to give you a probability of scoring X goals given an expected value of λ. Try running the example above with X=1 and λ=1.5 to produce ~33%. The following reads in the expected goals for and against Liverpool and calculates these probabilities for X = 0 to 10.

# Calculate Probabilitiy of scoring 0 to 10 goals in each game

ProbFor <- as.data.frame(rbind(dpois(0,LFCMatches$xG),dpois(1,LFCMatches$xG),

dpois(2,LFCMatches$xG),dpois(3,LFCMatches$xG),

dpois(4,LFCMatches$xG),dpois(5,LFCMatches$xG),

dpois(6,LFCMatches$xG),dpois(7,LFCMatches$xG),

dpois(8,LFCMatches$xG),dpois(9,LFCMatches$xG),

dpois(10,LFCMatches$xG)))

#Calculate probability of conceding 0 to 10 goals in each game

ProbAgainst <- as.data.frame(rbind(dpois(0,LFCMatches$xGA),dpois(1,LFCMatches$xGA),

dpois(2,LFCMatches$xGA),dpois(3,LFCMatches$xGA),

dpois(4,LFCMatches$xGA),dpois(5,LFCMatches$xGA),

dpois(6,LFCMatches$xGA),dpois(7,LFCMatches$xGA),

dpois(8,LFCMatches$xGA),dpois(9,LFCMatches$xGA),

dpois(10,LFCMatches$xGA)))

Now we have two dataframes with the probabilities of scoring and conceding each amount of goals in 29 gameweeks. We can calculate the probabilities of each draw result simply by multiplying each respective entry:

# Calculate probability of drawing at each score in each match

DrawProbs <- ProbFor*ProbAgainst

Calculate the total probability of a draw in each game by summing the probability of 0-0, 1-1, 2-2,… etc:

#Calculate total probability of a draw in each match

DrawProb <- vector(mode=”numeric”, length=29)

for (i in 1:29){

DrawProb[i] <- sum(DrawProbs[,i])

}

To calculate the total probability of a win for Liverpool, we multiply the combinations in each dataframe where Liverpool’s score is greater than the opposition and add up all possible combinations. Eg. Probability that Liverpool score 2 multiplied by the probability that Bournemouth score 0 = probability that Liverpool win 2-0. Implement by doing this for every score by:

#Calculate win probability of each match

WinProb <- vector(mode=”numeric”, length=29)

for (j in 1:29){

for (i in 1:10){

WinProb[j] = WinProb[j] + sum(ProbFor[(i+1):11,j]*ProbAgainst[i,j])

}

}

Apply the same method to calculate the probability of a loss:

# Calculate loss probability of each match

LossProb <- vector(mode=”numeric”, length=29)

for (j in 1:29){

for (i in 1:10){

LossProb[j] = LossProb[j] + sum(ProbAgainst[(i+1):11,j]*ProbFor[i,j])

}

}

Now that we have the win, draw and loss probabilities for each game, we can gather these into one dataframe.

#Gather Win/Draw/Loss probabilities

MatchProbs <- as.data.frame(cbind(WinProb,DrawProb,LossProb))

MatchProbs[‘Gameweek’]<- c(1:29)

You should do a sense check on this dataframe by summing the win, draw and loss probabilities to make sure that they sum to ~1. We can perform a quick summary of Liverpool’s season based on the results with ggplot:

#Plot win probability by gameweek

MatchProbs %>%

gather(key, value, c(WinProb,DrawProb,LossProb))%>%

ggplot(aes(Gameweek, value)) +

geom_col(aes(fill = factor(key, levels=c(“LossProb”,”DrawProb”,”WinProb”))),width=0.5) +

coord_flip()+

xlab(“Gameweek”)+

ylab(“Win Probability of LFC by gameweek”)+

ggtitle(“LFC Win probability based on xG”)+

theme_minimal()+

theme(plot.title = element_text(size=(14),hjust = 0.5,face=”bold”))+

scale_fill_manual(values = c(“red”, “grey”, “darkgreen”),guide = guide_legend(reverse = TRUE),labels=c(“Loss”,”Draw”,”Win”))+

theme(legend.title = element_blank())+

theme(axis.title.x = element_text(size = (12)))+

theme(axis.title.y = element_text(size = (8)))+

theme(axis.text.y= element_text(size= (7)))+

scale_y_continuous(minor_breaks = seq(0 , 1, 0.1), breaks = seq(0, 1, 0.1),limits = c(0,1),expand = c(0, 0))+

scale_x_continuous(breaks = seq(1, 29, 1))

Liverpool have only lost one game this year but you could take a good guess at which one it was from this! Week 18 was a brilliant away performance against Leicester.

### Limitations

This type of method forms a basis for result prediction models. We can attempt to encode our beliefs into this rate parameter λ in a variety of ways to make predictions about future games. However, the standard poisson distribution is not the method used for many of these models. This is because the distribution has some drawbacks in it’s application to football:

- The poisson distribution tends to underestimate the probability of 0 goals in a match, as well as the probability of draws.
- There is an assumption built into the poisson distribution that events must be independent. This means that the number of goals scored by a team should not make the number of goals scored by another team more or less likely. There are many games and teams that don’t seem to follow this pattern.
- Poisson also assumes that the variance of X is equal to the rate parameter λ. In practice, it seems that the variance is often greater than λ (overdispersed).

These are important considerations to make if we are to draw conclusions from the results or build a model for prediction purposes. Many people have shown these assumptions to be somewhat unrealistic for football. A bivariate poisson distribution with several adjustments is one common approach that is used to account for these problems, but I won’t attempt to go into those details in this post.

Despite these limitations, the poisson distribution is a good starting point to build up an understanding of probability distributions and how they can be applied to football. The results will often provide a relatively good indication of match probabilities but shouldn’t be taken too literally given the limitations of expected goal single game analysis. More complex models can then be introduced to address the limitations if required. I hope the R code and method can be of some use!