The summer transfer window is just around the corner and every football fan has a wish list of players that they’d love their club to sign. There is much debate around transfer fees and the valuation of players. The typical fan comes up with a ‘finger to the air’ type estimate on value, but they are in some way subconsciously factoring in numerous variables, assigning each one a weight and then coming up with a figure. The analytics community tend to lean more heavily on the numbers to inform their estimation. This led me to an interesting question. How well could we model the value of a player with statistics? If we place a lot of faith in these statistics to form judgments, shouldn’t they offer a decent indication of the value of a player? Of course, not everything that affects the value of a player is directly measurable and recorded in event data. There are many other features to consider: contract lengths, commercial considerations, injury history etc. Nevertheless, I thought it was an interesting exercise to see how close we could get from widely used statistics.

The first step was to collect data from the top 5 European leagues from fbref going back as far as possible (17/18): https://fbref.com/en/comps/Big5/Big-5-European-Leagues-Stats . I scraped the web for transfer fees of any player in the sample who moved club in this time and began building a random forest based model. The aim was to link transfer fee to player statistics in the year prior to their move. I’m also looking at weighting further historical data in the future.

The importance of each statistic varies greatly by position. For example, expected goals is very important for forwards, but not so much for defenders. We tend to have more data on attackers that correlates with player quality, so it seemed likely that the model would work best for these players. This led me to initially focus on attackers. Splitting the modelling process by position also offered some really interesting insight that I’ll go into later.

After gathering a large range of statistics, I wanted to add more context. Within each position, there are many different roles. Roberto Firmino plays a different role to Pierre-Emerick Aubameyang, and the attributes that help them to excel in those roles are also different. I wanted the model to identify the attributes contributing to player value for different roles. I attempted to build this information into the model by using hierarchical clustering to identify player types. Broadly, I came up with two groups of attackers which was split on creative types/ players involved in build up play vs goalscorers and more traditional strikers. I clustered on a large range of statistics, but it was clear to see the relationship that was picked up by plotting expected goals vs expected goals assisted by cluster:

I also wanted to add more context at a team level. It seemed reasonable that a player’s team could be having a significant impact on the statistics of the player. Gabriel Jesus puts up incredible goalscoring numbers at Manchester City, but he’d probably struggle to replicate the same numbers at Aston Villa. I wanted the model to have some indication that the player’s team could be relevant for the valuation. I performed hierarchical clustering again on some team related statistics to get a broad indicator of team styles and strengths. The relationships picked up here weren’t quite as clear so it might take some tweaking, but there were 3 groups of teams picked up. It seems the partition is somewhat based on team strength more than anything else right now:

Finally, it was time to fit the model itself. I spent a lot of time playing around with predictor variables, tuning parameters and assessing results. So, after all that effort.. did the model look useful? Short answer: yes and no.

The results looked okay for middle of the range players. The model had very few observations to learn about expensive players, so it tended to be very hit and miss for great players. The sample size was ~200 because of data limitations, so it certainly isn’t at the point of being trustworthy for a large range of players *yet*. Here were some of the predicted valuations of players that transferred in previous years (figures in millions of pounds):

So, like I said.. It looks okay for some groups of players, but still quite hit and miss for others. Raúl Jiménez valued at £32 million last summer? Looks relatively reasonable. Joelinton, Iwobi and Pépé all seem to throw up reasonable valuations as well. Eden Hazard valued at £28 million at the time of his transfer to Real Madrid? Hmmm.. Definitely more work to do. Of course, this isn’t the way to properly assess a model, but I’ll spare you the details of the more formal testing procedure. The model was giving relatively decent results, and I’m hoping it can get a better grasp on great players with a bigger sample. Stay with me for a second, because the model showed promise in a few other ways too.

One great feature of random forest based models is that you can easily extract the importance of predictor variables. So what statistics did the model think were most important in predicting the value of an attacker? This graph ranks the predictors by importance from top to bottom:

This is where I gained encouragement. The model thought that non-penalty expected goals per 90 was the most important statistic in predicting the value of an attacker by a significant margin. This is exactly what I would have thought to be the most important statistic in advance. Some of the other important predictors were: xA per 90, progressive distance carried per 90, touches in the penalty area per 90 and progressive distance passed per 90. So why is all of that important? All of the features that the model thought were most important in predicting player value were generally the statistics that analysts find most useful to inform player judgments. The sample is too small to say anything definitively from this, but after just a few hundred observations, it was picking up many of the trends that I would have expected. I have seen little to no public work on trying to gauge the relative importance of different statistics by position or role. It was good to see that the model very quickly picked up on the traits that we give the most weight in the analysis of forwards.

These results have the potential to be applied in a real world setting. If a team is looking to recruit a player to fill a particular role, this framework could help in identifying which statistics should be given the most weight. The sample size needs to grow a lot more, but we can identify the statistics that are most associated with valuable players for a role and then prioritize those variables in the recruitment process. There is evidence from my results that this is already working to an extent for attackers from a few hundred transfers, so I think it has promise for other positions and roles too. Stay tuned for a follow up where I’ll look at this for midfielders and defenders. I’ve never been sure about how much faith to put in each statistic for these positions because they tend to be very dependent on systems. However, this could be a way to help if we can gain a large enough sample.

I always go back to the famous George Box quote when assessing a model that I’ve built. “All models are wrong, but some models are useful.” I’m excited to see if this can grow into something that becomes useful as my sample grows and the process becomes more refined. At the very least, I think that there may be potential to flag big mistakes or big bargains for particular types of players. It could also help to streamline the recruitment process by giving a better indication of the importance of different statistics for different roles. There are plenty of limitations and areas to improve at the moment, but the method has shown promise.

The whole process ended up giving a lot of insight that I didn’t anticipate. I think it’s safe to say that I haven’t cracked the transfer window just yet, but I may be onto something useful.