There’s a shift taking place in the soccer analytics landscape, and it’s probably overdue. For the past year or so, expected goals models based on shots have been very much in vogue. But recently, several analysts and commentators – notably Max Odenheimer, Richard Whittall, and myself – have pointed out that every situation on the field has some chance of turning into a goal. The big questions are whether these situations can be measured and, if so, whether the resulting models are superior to existing alternatives.
The answer to the first question depends in large part on the quality of data. Anyone who’s watched a fair amount of soccer can name a few situations that are likely to lead to goals: a two-on-one fast break, a one-on-one with the goalkeeper, a centerback left alone for a free header, etc. Finding these situations in game data is more difficult. Tracking data identify the positions of all players and the ball, but they’re not always available for all the teams in a league, or indeed in the same format across leagues. But we can still identify some situations using on-the-ball data like those collected by Opta.
I chose a very simple definition of a situation for this analysis: passages of play in a danger zone where scoring was likely. A focus on danger zone entries is nothing new; a Dick Bate presentation emphasizing their importance that has been making the rounds recently is several years old. Through a modest search, I identified a danger zone in the final third of the field where, once the ball entered in possession of the attacking team, there was a non-negligible chance of a goal – any kind of goal, including penalties and own goals – being scored before the ball left. Then I measured the probabilities of a goal depending on where the ball entered the danger zone, as summarized on this graph for the English Premier League in 2012-13:
My next step was to see how well danger zone entries correlated with results. I weighted each entry by the probability of a goal occurring and summed across all of each team’s games, essentially giving the expected goals from danger zone entries. (Note that since other situations can also lead to goals, I wouldn’t expect these expected goals alone to have a slope of one versus actual goals.) The next graph shows how final positions in the table were related to the expected goals from danger zone entries:
The graph shows a pretty strong correlation, comparable to what a shots-based expected goal model would yield, suggesting that danger zone entries might be a useful metric for assessing teams. Of course, I can also break it down for individual players, once again weighting each danger zone entry by the expected goals likely to result. Here I decided to split the credit between passers and receivers of the ball in the danger zone, giving full credit for individual efforts such as interceptions and control of loose balls. The following table lists the top players by total expected goals per 90 minutes (among those who played at least 1,900 minutes total), with a breakdown into the three types of actions:
It’s an interesting list; Luis Suárez may be showing signs of the big season to come, and Andy Carroll may be illustrating why he’s one of Sam Allardyce’s favorite players. And if you see anyone who scores 0.07 or better in each of the three categories, well, sign him up right away.
So what are the takeaways from this exercise? For teams, the graph showing danger zone entries and final positions suggests that this method suffers from some of the same drawbacks as shot-based expected goals models; it doesn’t know why Manchester United won the title, and it’s not particularly good at distinguishing between teams in small bunches. For players, this is mostly a rating of offensive prowess; to bring defense into the picture, we’d have to give defending players demerits for permitting danger zone entries to happen. One side note: Including instances where the ball entered the danger zone but the attacking team failed to keep possession – Max Odenheimer’s “phantoms” – doesn’t make much difference to the correlation with results, and as the basis for a player rating it could also create some bad incentives.
In the near future, I’ll be using tracking data to look at different kinds of situations like the ones I mentioned above. But if this simple analysis can already yield some interesting – and perhaps even useful – results, then expected goals from situations may indeed be a fruitful direction for research.