Soccer analysts have borrowed and created many ways of measuring the game, but they have yet to come up with the sort of unified theory that relates these metrics to each other. Points and positions can indeed be complicated to decompose, but there is a fairly straightforward way to break down goal difference. Doing so reveals which metrics fit together… and which don’t.

Goal difference – the difference between goals scored and conceded – is highly correlated with positions at the end of the season, and arguably a bit less idiosyncratic. It can also be broken down into four simple components. One of these components is mostly a measure of luck. The other three can be decomposed further into the influence of individual players, and two of these three regress to the mean. What is left is arguably the most useful predictor of final goal difference.

To start, let’s define a couple of terms. *N* will be the number of shots a team takes during a season, indexed by *i*, and *M* will be the number of shots it faces, indexed by *j*. Each shot has some probability of resulting in a goal, which we will call *ExpG* or expected goals.

There are many ways of calculating expected goals. I prefer to base them on the average likelihood of scoring at a given distance from goal, but others have used zones of the field, types of shots, and the player’s visual angle on the goal frame. Regardless of how we compute *ExpG*, the difference between *ExpG* and the actual number of goals scored (0 or 1 per shot) will be the error term *e*.

Of course, some goals are own goals not scored by a team’s own shots. We will denote own goals for *OGF* and own goals against *OGA*. Using this notation, we can state the number of goals for, or *GF*, this way:

(1)

Similarly, the number of goals against, or *GA*, is this:

(2)

Combining the two terms, goal difference, or *GD*, can be stated like this:

(3)

We can simplify this a bit by calling own goal difference *OGD *and expected goal difference *ExpGD*:

(4)

And we can simplify further by replacing the individual error terms for each shot with an average error term. In other words, if a team scored 2% more goals than expected, then its average error for scoring, which we’ll call *eSc*, would be 0.02. Similarly, if a team conceded 2% more goals than expected, its average error for conceding, or *eCo*, would be 0.02. Here is how the notation looks in our equation:

(5)

Now, let’s say *eCo* is -0.03 – in other words, the team conceded 3% fewer goals than expected, on average. This means that its probability of stopping a shot (where saving is interpreted broadly as anything not leading to a goal) was 3% higher than expected. So we can replace the negative *eCo* term with a positive *eSt *term, for the average stopping error:

(6)

In my current thinking, this is the main equation that matters in soccer. To interpret the equation, it helps to consider each term in turn. *OGD* is pretty clearly a product of luck, or at least factors that today’s data fail to explain. A team may be able to affect *N* and *M* through its tactics, but there is plenty of evidence that *eSh* and *eSt* regress to the mean – which is zero – within a season and between seasons. *ExpGD*, on the other hand, is fairly predictive and consistent.

Breaking down equation (6) allows me to put a cash value on players. I only need to consider their contributions to the four terms above – exactly how many of them is a matter of judgment – and then combine the results with the value of goals scored and conceded by a given team. This method gives the same player a different value (within a confidence interval) for every team in the league, which I believe to be entirely appropriate.

Equation (6) can also be restated in terms of other popular metrics, but doing so shows how messy the interpretation of those metrics can be. For an example, take *PDO *and *TSR.* The first metric is essentially a team’s shooting percentage added to its stopping percentage. If we assign the league averages for shooting and stopping percentage to the terms *LSh* and *LSt*, we can state a broad version of *PDO* as *eSh + LSh + eSt + LSt*. Total shots ratio, or *TSR*, is a team’s own shots divided by the sum of its shots and shots by the opposition, or *N* divided by *N + M*. We can incorporate both of these metrics into the equation above, noting that the sum of *LSh* and *LSt* is constrained to be 1:

(7)

Bizarrely, *GD* seems to fall when *TSR* rises. Moreover, the interpretation of the last two terms is a bit difficult; why should higher shooting and stopping errors lead to lower *GD*? But this isn’t really the case, since *N* and *M* are components of *TSR* and *eSh* and *eSt* are components of *PDO*. The problem is that *TSR* and *PDO* don’t enter the equation cleanly on their own. You can try to insert them any way you like, but you’ll always end up with some of those other terms kicking around.

I dislike both *TSR* and *PDO* because neither can be part of a simple decomposition like the one in equation (6). They may be correlated with goal difference and have other interesting properties, but they are not basic parts of any equation I know that explains *GD* completely. I see them as approximations that work in some circumstances, rather than covering the entire universe – a bit like Newtonian physics versus Einsteinian physics.

All the terms in equation (6) can be measured, and all of them – perhaps excepting *OGD* – can be decomposed to show the influence of individual players. If one term is luck and two others regress to the mean, then *ExpGD* should be the most reliable explainer of goal difference. And indeed, it is more predictive of goal difference as a season progresses than *TSR*, shots on target difference, and other popular metrics. Until a better version of equation (6) comes along, why use anything else?