Umetrics Suite Blog

Can multivariate data analysis predict the winner of the World Cup?

June 13, 2018

Multivariate data analysis (MVDA) is a statistical technique that can be used to analyze data with more than one variable in order to look for deviations and understand the relationships between the different data points. In practice, this can mean taking data from a number of different sources and turning it into meaningful information from which you can draw some conclusions.


world cup image russia 2018

Multivariate data analysis can provide insights into how the betting odds from various online game sites might be used to predict the outcome of the World Cup.

Typically MVDA is used to analyze processes like those used in manufacturing or laboratory experiments, examine research data, or even crunch numbers on the stock exchange. It can even be used to “predict” outcomes given a specific set of variables and understanding of past behavior. (More on that here).

In theory (as well as in practice) this can be applied to predicting the outcome of events such as sporting games, like the 2018 World Cup in football (soccer). So we thought we’d have a go at it. Can we use MVDA to predict the likely winner of the 2018 World Cup? Perhaps. But we can surely use data to explain how MVDA works and maybe give you a tip or two for how to select betting odds.

Curious? Let’s take a look as some of the variables involved in determining which team is most likely to win the 2018 World Cup. To do this, we’ll consider the betting odds of 22 online gaming sites (X-variables), along with the FIFA-ranking (Y-variable). Using these, we’ll look at 32 international teams (observations) that have qualified for the play-offs. All together, the data looks like this:


(Data compiled November 16, 2017. See note below).

Examining the data

Some of the questions we can look at are:

▪ Who is a likely winner to bet on?

▪ Which teams should we not bet on?

▪ Which betting firms can be ”trusted” (in the sense they provide similar odds profiles)?

▪ Is there a betting firm that should be avoided?

▪ Is there any relationship between betting odds and FIFA-ranking?

▪ What about variability with time?

Favorites to win

Looking at the data we can see a correlation among the betting sites for teams that are favored to win versus those which are not likely to win. According to this data, Panama and Saudi Arabia are not very likely to win.

image 2

(This chart shows a central cluster of favorites along with a few outliers and deviations).

So which team is favored among all the sites as a possible winner? If we zoom in we can see the favorites are in a cluster. Those given the best chances of winning are at the top left of the plot (Germany, Brazil, France, Spain, Argentina).


(The central cluster of data shows the teams most favored to win).

Look at the deviations

As the previous chart shows, there are several outliers, or deviations in the data. Two of these deviations are Panama and Saudia Arabia. Let’s take a closer look at why this is. Using a contribution plot can reveal the cause of the deviation. Here, we can see there are large odds from all the betting firms for these two teams, which explains why they lie outside the central area for potential winners.




In practical terms, that means the possible payouts for betting on either of these two teams is high. If you’re someone who likes to take risks, then this is where you’ll want to place your bests for the highest possible payout. You’re not very likely to win, however, according to the statistical analysis.

Other deviations

The data also shows deviations for Tunisia, Costa Rica and Iran.


A contribution plot can reveal the cause of the deviation. Again, we can see that we have large betting odds from a couple of the betting firms (but not all of them).


If we take a closer look at the betting odds for one firm (888Sport) with the highest odds, it shows a much higher deviation for several teams: Saudia Arabia, Panama, Tunisia, Iran, and Costa Rica. Betting on any of these teams with this firm would generate a higher return than the others if these teams were to win.


Rankings and odds

The next question is of course, whether the betting firms all have the same rankings? Yes, to a large extent, they do. But some have opinions about favorites. (And these may provide better odds for betting).


For example, there is a minor difference in regard to the disliked teams by several firms (Skybet & Labrokes for example downplay Panama and Saudia Arabia) and 888sport downplay Tunisia, Costa Rica and Iran. The plot shows these firms stray from the pack in terms of their high and low points on the chart.

Relation to FIFA ranking

Another question we might ask is if there is any relation between the odds and FIFA-ranking?


In fact, there is a pretty strong relationship between the odds and FIFA-ranking (61.6% correlation) if you look at the horizontal axis. The chance of the odds-variability not being related to the FIFA-ranking is 34.5%. (The vertical axis represents 28.4%).

So which betting firms follow the FIFA ranking most closely?

That would be BetStars, sportingbet, SunBETS, William Hill and so on in decreasing order as shown here:

image 10

And those which deviate from the FIFA-ranking (which may mean better odds for you if you want to bet on one of the least-favorited teams) are BoyleSports, BETRIGHT, Ladbrokes, etc, in decreasing order as seen here:

image 11

Another question we might ask:  is there is any geographical spread of favorites?

If we look at the data, we can see the odd setters predict the winner will be from Europe or South America.

image 12


Actual versus predicted FIFA-ranking

The largest deviation overall is for the host nation (Russia). It’s not unusual for the host nation to have higher odds of winning than they would otherwise. There is a phenomenon in sports around this, in that the support of the home team audience can have a profound effect on the way the team plays and can lead to victories that otherwise might not happen. This is true here as well. The betting firms give Russia a higher chance of winning than what would be suggested from the corresponding FIFA-ranking. It shows up as an outlier on the plot below.


Who’s going to win?

In summary, if you’re looking to bet on a team, the safe choices, across all the betting firms are:

  • Germany
  • France
  • Brazil
  • Argentina
  • Spain

These five are expected to win and have the data that shows strong betting odds in their favor. Depending on the level of risk taker you are, betting on one of these teams, and with a betting firm that gives you good odds for them, will be in your favor.

If you like a specific team, for example, Russia, you might go with a firm that gives especially high odds for that team compared to the others, for a higher payout. Some teams in the mid-layer (Russia, Croatia, Columbia, Uruguay) could be good for betting with a better payout in that case.


And then other teams (Sweden, Poland, Denmark, Mexico, Switzerland), these are the “bargains”. They have “up weighted” odds on some betting sites, but are generally higher on the odds for winning. So betting on these could have a good payout if you identify the betting firms that deviate from the others with higher odds.

So, I’m sure you want to know. Who do I think will win?

That would be Germany. I can quote Gary Lineker here:

“Football is a simple game. Twenty-two men chase a ball for 90 minutes and at the end, the Germans always win.” (Gary Lineker) 

(click to Tweet this)



Some conclusions we can draw from this data analysis.

  • Overall, strong correlations exist among all of the betting firms.
  • Favorite teams have similar odds profiles across all betting firms.
  • The relationship with FIFA-ranking is pretty strong (there is a 61.6% of odds variability correlated with FIFA-ranking).
  • The favorite teams are European and South American teams, where some up-weighting occurs for the host nation.

So, with that, enjoy the game! And may the odds be always in your favor.


Want to know more about Multivariate Data Analytics?

You can find a beginner’s guide blog article here.


Download Presentation 


Have an idea or comment for a future blog topic?

Let us know in the comments below or submit your blog idea here.


(Note: The data on betting odds for this blog post were compiled November 16, 2017, i.e. before the final draw into the eight groups comprising four teams each. Data may have shifted since then. Bear in mind that the objective of this blog post is to shed light on how advanced data analytics can be used to highlight interesting information and major trends in betting data, not actually select outcomes for this game. In addition, this post is not meant to endorse, recommend or in any way support or oppose any particular online betting site or condone or encourage the act of betting.)


Topics: Multivariate Data Analysis, Data Analytics

Lennart Eriksson

Written by Lennart Eriksson

Sr Lecturer and Principal Data Scientist at Sartorius Stedim Data Analytics