Soccer analysts have borrowed and created many ways of measuring the game, but they have yet to come up with the sort of unified theory that relates these metrics to each other. Points and positions can indeed be complicated to decompose, but there is a fairly straightforward way to break down goal difference. Doing so reveals which metrics fit together… and which don’t.
Goal difference – the difference between goals scored and conceded – is highly correlated with positions at the end of the season, and arguably a bit less idiosyncratic. It can also be broken down into four simple components. One of these components is mostly a measure of luck. The other three can be decomposed further into the influence of individual players, and two of these three regress to the mean. What is left is arguably the most useful predictor of final goal difference.
To start, let’s define a couple of terms. N will be the number of shots a team takes during a season, indexed by i, and M will be the number of shots it faces, indexed by j. Each shot has some probability of resulting in a goal, which we will call ExpG or expected goals.
There are many ways of calculating expected goals. I prefer to base them on the average likelihood of scoring at a given distance from goal, but others have used zones of the field, types of shots, and the player’s visual angle on the goal frame. Regardless of how we compute ExpG, the difference between ExpG and the actual number of goals scored (0 or 1 per shot) will be the error term e.
Of course, some goals are own goals not scored by a team’s own shots. We will denote own goals for OGF and own goals against OGA. Using this notation, we can state the number of goals for, or GF, this way:
Similarly, the number of goals against, or GA, is this:
Combining the two terms, goal difference, or GD, can be stated like this:
We can simplify this a bit by calling own goal difference OGD and expected goal difference ExpGD:
And we can simplify further by replacing the individual error terms for each shot with an average error term. In other words, if a team scored 2% more goals than expected, then its average error for scoring, which we’ll call eSc, would be 0.02. Similarly, if a team conceded 2% more goals than expected, its average error for conceding, or eCo, would be 0.02. Here is how the notation looks in our equation:
Now, let’s say eCo is -0.03 – in other words, the team conceded 3% fewer goals than expected, on average. This means that its probability of stopping a shot (where saving is interpreted broadly as anything not leading to a goal) was 3% higher than expected. So we can replace the negative eCo term with a positive eSt term, for the average stopping error:
In my current thinking, this is the main equation that matters in soccer. To interpret the equation, it helps to consider each term in turn. OGD is pretty clearly a product of luck, or at least factors that today’s data fail to explain. A team may be able to affect N and M through its tactics, but there is plenty of evidence that eSh and eSt regress to the mean – which is zero – within a season and between seasons. ExpGD, on the other hand, is fairly predictive and consistent.
Breaking down equation (6) allows me to put a cash value on players. I only need to consider their contributions to the four terms above – exactly how many of them is a matter of judgment – and then combine the results with the value of goals scored and conceded by a given team. This method gives the same player a different value (within a confidence interval) for every team in the league, which I believe to be entirely appropriate.
Equation (6) can also be restated in terms of other popular metrics, but doing so shows how messy the interpretation of those metrics can be. For an example, take PDO and TSR. The first metric is essentially a team’s shooting percentage added to its stopping percentage. If we assign the league averages for shooting and stopping percentage to the terms LSh and LSt, we can state a broad version of PDO as eSh + LSh + eSt + LSt. Total shots ratio, or TSR, is a team’s own shots divided by the sum of its shots and shots by the opposition, or N divided by N + M. We can incorporate both of these metrics into the equation above, noting that the sum of LSh and LSt is constrained to be 1:
Bizarrely, GD seems to fall when TSR rises. Moreover, the interpretation of the last two terms is a bit difficult; why should higher shooting and stopping errors lead to lower GD? But this isn’t really the case, since N and M are components of TSR and eSh and eSt are components of PDO. The problem is that TSR and PDO don’t enter the equation cleanly on their own. You can try to insert them any way you like, but you’ll always end up with some of those other terms kicking around.
I dislike both TSR and PDO because neither can be part of a simple decomposition like the one in equation (6). They may be correlated with goal difference and have other interesting properties, but they are not basic parts of any equation I know that explains GD completely. I see them as approximations that work in some circumstances, rather than covering the entire universe – a bit like Newtonian physics versus Einsteinian physics.
All the terms in equation (6) can be measured, and all of them – perhaps excepting OGD – can be decomposed to show the influence of individual players. If one term is luck and two others regress to the mean, then ExpGD should be the most reliable explainer of goal difference. And indeed, it is more predictive of goal difference as a season progresses than TSR, shots on target difference, and other popular metrics. Until a better version of equation (6) comes along, why use anything else?