I’ve finally got around to creating my own expected goals model in R. There are plenty of good explanations out there for anyone who is not familiar with the concept, but I’ll be looking at a few different methods that could be used for a model in this post. The methods that I’ve used are:

- Logistic Regression
- Random forests
- Bagging
- Boosting
- Support vector machines

Each method has pros and cons for a given problem. First, I compared each model in terms of classification. This looks at how well each model predicts the outcome of shots. I then looked at the probabilities outputted by each model. The probability which is predicted by the model of a shot being categorised as a goal is the expected goals value.

I’ve used the Messi data from Statsbomb to create these models. It is a great resource for anyone looking to experiment with football analytics. They also have a Statsbomb R package which makes it easy to pull data from different competitions and seasons.

## Method

### Pulling the data

The first task was to pull some shot data. The Messi data set contains data on Barcelona games in La Liga. I pulled all shots from the 10/11 season to the 18/19 season. This gave ~7500 total shots after a little bit of data cleaning. Most of these shots involve Barcelona players, with a high proportion coming from great players like Messi. This would probably cause generalisation problems if I were to use this model for prediction on random teams. However, the methodology would be the same if a model was trained using different data.

### Selecting Predictors

Next, I had to select the predictor variables for the model. There are a variety of methods to perform variable selection. I was more focused on getting a functional model before optimising the selection of predictor variables, which meant the procedure was a little ad-hoc. Firstly, I used a forward stepwise regression. This works by looking at a model with no predictors and then adds the variable that is most statistically significant at each step until there are no significant variables remaining. Overall, this works by minimising the AIC value, a goodness of fit measure for a model. I will look to improve this method in the future due to some of the limitations of stepwise regression, but it acted as a starting point to give me some idea of which variables had high importance. I also removed some predictors which were highly correlated. This method was a bit recursive because I repeated variable selection several times after gauging the impact on results of including different variables.

### Training the model

I took the resulting data set and divided this into a training, validation and test set with a 60:20:20 split. I fit the models on the training data and used the validation set to judge the performance of each method. This was done by 1) comparing the actual versus predicted classification of shots and 2) comparing how well the sum of expected goals values matched the sum of actual goals. I used this validation phase to perform model selection.

I used the model selected during the validation phase to perform predictions on the test set. I repeated this procedure 100 times to account for random sampling.

## Results

### Comparing classification accuracy

The methods all performed very similarly in terms of how well they predicted the outcome of shots. Each method was correctly predicting the outcome of 87-88 shots on average.

The following graph shows the results of the 100 random samples on the validation data for each method. The solid lines show the mean classification accuracy across samples. The dashed lines show the classification accuracy for each method at each sample. There is a lot of overlap because of how similar the performance was across methods.

The average classification accuracy of each method was very similar. The standard deviation of each method was also very similar. Overall, random forests were chosen most often in the 100 samples as the “best” method. Random forests were chosen 37 times, bagging was chosen 33 times, logistic regression was chosen 24 times, boosting was chosen 4 times and support vector machines were chosen twice. The classification accuracy on the test set was recorded for the model which performed best each time.

This graph shows a boxplot of the classification accuracy on the test set for the best method chosen on each iteration.

This also showed that there wasn’t a large difference between methods for predicting the outcome of shots on unseen data. The main reason for this was because of the probability cut off of 0.5. Very few shots across all methods are classified with expected goals value over 0.5, so this was leading to very little disagreement in classification. I experimented with finding more optimal cut offs using the ROC curve for each method, but the performance didn’t improve much. The methods were quite good at predicting misses, but not as good at predicting goals. I decided that the expected goals probabilities were a better way to test the models.

### Comparing expected goals probabilities

Logistic regression, random forests and support vector machines performed very similarly in terms of expected goals. Bagging and boosting performed poorly. This may have been due to the imbalance of classes in the data set, so I concentrated on the other methods. The following graph shows the difference between expected and actual goals in the validation phase for 100 samples of ~1450 shots.

Logistic regression performed best in terms of expected goals. The average difference between actual and expected goals was 10.9 for logistic regression, 11.7 for support vector machines and 12.4 for random forests. The standard deviation of the difference between actual and expected goals was 8.4 for logistic regression, 9.2 for support vector machines and 9.1 for random forests.

Logistic regression was closest to actual goals in 41 samples, random forests were closest in 34 samples and support vector machines were closest in 25 samples. The following graph shows the accuracy of the best method on the test set over the 100 samples. The average difference between logistic regression expected goals and actual goals was 3.4 on the test set.

### Comparison to Statsbomb expected goals

The logistic regression expected goals performed best, so I compared the model to the Statsbomb expected goals. I split the original data into a training and test set (70:30). I fit the model to the training data and compared the predictions on the test set to the Statsbomb expected goals. The following graph shows the difference between expected and actual goals for 100 samples of ~2200 shots.

My logistic regression model was closer to actual goals in 73 of the 100 samples. The average difference between actual and expected goals was 14.81 for the logistic regression model and 23 for the Statsbomb model. This test was a little unfair to Statsbomb because my model was trained on data which mostly contained shots from players like Messi. Their model consistently undershot actual goals for this data, but may be more generalizable to shots taken from random teams. I would need a broader data set to judge which model was actually “better”.

### Variable importance

Lastly, I had a look at the importance the random forest was giving to each variable included in the model. This graph ranks the importance of each variable from top to bottom. It also shows how much prediction power would be lost be dropping each variable from the model. The distance from goal predictor explains a huge amount of the predictive capability of expected goals models.

## Conclusion

There wasn’t much difference between methods for predicting the outcomes of shots. It may be more appropriate to judge the models based on expected goals instead. Logistic regression performed best for expected goals, but random forests and support vector machines performed similarly.

Thanks for reading and feel free to send on feedback.