I’ve developed a Shiny app in R which aims to model the similarity of player styles in Europe’s top 5 leagues. I’ve just looked at midfielders for now but I may extend this for other positions. I’d recommend reading this piece before actually using the app to better understand the method and limitations. The link is here https://eoinobrien.shinyapps.io/factoranalysis/ .
There has been some great public work in this area recently and principal components analysis (PCA) has become a common method to reduce a large number of variables to a smaller number of playing styles. The difference between PCA and factor analysis can be quite subtle, but I’ll build on some previous work to make the case for factor analysis here and why I think it should be the more natural choice.
The main thing that comes to mind for me with PCA is a dimension reduction technique. We have a large number of correlated variables and the aim is to map the variables to a lower dimension space with a set of linear combinations such that the resulting principal components are orthogonal. We are then left with a set of principal components that explain a given amount of variance from the original data set. The aim of factor analysis is different. Factor analysis assumes that there is an unobservable construct (ie playing style) causing correlations in the variables (PCA doesn’t). It also aims to reduce the data set to a lower dimensional space, but it seeks to attribute “common” variance in the original data set to a set of factors, and differentiate this with the unique variance in each variable. When the unique variance is small, PCA and factor analysis will give similar results, but this doesn’t seem to be the case from my own analysis for playing styles. Because of this, factor analysis can give us an idea of how much variance is actually attributable to playing styles, and differentiate this with unique variance in our statistics. PCA would simply attempt to account for all variance in the statistics with a set of principal components. It may explain >95% of the variance in the original data set, but that doesn’t necessarily mean that >95% of the variance is accounted for by the playing style.
I’ve started out by analysing midfield playing styles in Europe’s top 5 leagues. The data comes from Statsbomb via fbref. Factor analysis identified 5 unique playing styles in midfielders from 26 different statistics. They are shown below with their associated loading for each factor. As important as the actual method here is selecting statistics that are representative of style as opposed to strength. I’ve transformed many of the statistics from fbref to variables that I feel better capture playing style.
The interpretation of these factors from 1-3 looks something like: Attacking threats, direct progressive midfielders, all round midfielders (?). The remaining two factors aren’t quite as clear to interpret, but look to be mainly related to low short passing under pressure.
Overall, the factors accounted for 62% of the variance contained in the original data set, split 25%, 12%, 10%, 9%, 6%. Factor one explained the largest chunk of variance by a significant margin and was related more towards attacking statistics. This would suggest to me that attack minded styles for midfielders are more straightforward to identify from statistics than other roles.
This highlights the fact that there are significant limitations to consider with any type of quantitative analysis to identify player styles for midfielders when using these types of statistics. Many people consider 50-60% of variance explained to be a minimum for factor analysis to be meaningful, so the significance of these methods for some positions may be questionable.
However, the results from the eye test did seem somewhat reasonable and the method can probably be extended to attackers to produce more reliable results. I’ll probably continue to tweak the method because some of the results did seem a little off. As with any sort of modelling, it’s good to know what you don’t know. The factor analysis here gives an idea of how much variance is actually attributable to playing styles, while PCA may give a false sense of security when you see that >95% of the variance from the original dataset has been captured.
The uniqueness (shown below) is the amount of variance that is unique to each variable and not explained by playing styles.
Many of the defensive related metrics weren’t well explained by the factors (pressing, interceptions and tackles). Passing styles and area of the pitch for possessions were closely related to the latent factors.
The shiny app calculates the factor scores for each player and finds the most similar players based on the smallest euclidean distance. The output is shown for Pierre-Emile Højbjerg who Southampton will be looking to replace after his move to Spurs. Philip Billing from relegated Bournemouth is proposed as a similar match on playing style. This type of analysis could be useful in recruitment strategies to narrow down a very large group of players to a smaller pool of players with similar styles. I wouldn’t fixate hugely on any one name such as Billing here in the results, but consider the smaller group of players as a whole that the model is highlighting as potential matches.
The app is relatively basic for now, but I may develop this further and add other positions & features if there’s enough interest. I’m interested to hear people’s thoughts on the method so feel free to drop me an email if you can see any improvements.