120 Years of Olympic History

Introduction

The Olympics, first held in 1896, have been a global event for over 120 years. The games have featured the best athletes from around the world, competing in a variety of sports. Although the Olympics are seen as the ultimate test of fair competition at the highest level, not all countries are equal in terms of their success.

Some might argue that the success of a country in the Olympics is related to the wealth of that country. Others might claim that larger countries with more people have a stark advantage. Still others might argue that being the host country reaps significant benefits.

In reality, the true relationship is likely a combination of all three (and many other factors). In this report, we will explore those three aformentioned factors and their relationship to Olympic success using data from past events and GapMinder socioeconomic data.

Loading the Data

See the included .Rmd file for the full code.

First, we loaded in the Kaggle Data which includes athlete_events.csv (contains a record for every athlete in every event in the Olympic Games from 1896 to 2016)
- Includes the medal they won in a particular event (if any)
The Kaggle data also included noc_regions.csv which stores the National Organizing Committee (NOC) regions represented in the Olympics.

Then, we load in the GapMinder data which includes socioeconomic data of countries collected every 5 years from 1952 to 2007.
To facilitate the merging of the data, we also loaded in the country codes data, which helps us merge the GapMinder data with the Olympic data by including ISO 3166-1 country codes.
And finally, we load in the data which contains the host cities and corresponding countries for each of the Olympic Games from host_cities.csv

Cleaning the Data

We repaired the noc_regions data by adding SGP as the preferred NOC for Singapore over SIN since 2016
There were a few inconsistencies between the region names in noc_regions and those in country_codes. We corrected these inconsistencies by manually specifying these corrections in a list and then using these corrections to update the noc_regions data. Some inconsistencies were Palestine being referred to as West Bank and Gaza and mispellings like Boliva instead of Bolivia to just name a few.

Merging the Data

First, we obtained the number of medals won by each country in each year/season combination from athlete_events
Then, we extracted the city and year/season combination for each unique Olympic games.
After that, we calculated the total number of athletes by counting the total number of distinct names in athlete_events and then joined this together with the two previous dataframes to get a comprehensive dataframe of medals won by each country in each year/season combination.
This, in itself, isn’t enough to answer the questions we have posed. We need to merge this data with the GapMinder data to include the GDP and Population. But first, we need to add the country codes from country_codes to the GapMinder data.
The dataframe with medals won we created earlier uses the IOC country codes, while this GapMinder data uses ISO 3166-1 codes. To combine these two dataframes, we need to convert the IOC codes to ISO codes which can be done using the countrycode package.
Since the GapMinder data only goes from 1952 to 2007, we will only be able to compare GDP and population data in this timeframe.
Additionally, the GapMinder data is collected every 5 years, so to match it with the Olympics data, we take the approach of interpolating in between years by averaging the 3 years before and after the year in question.
- This will preserve the GDP and Population data for years that match the Olympic Data and average the GDP in years that fall in between the years of the GapMinder data.
Finally, we merge the GapMinder data with the medals won data to get a comprehensive dataframe that includes GDP and Population data for each country along with the number of bronze, silver, and gold medals won by that country in each year/season combination. This is stored in the df_combined dataframe and one random country for each year is printed in the table below.

## # A tibble: 29 × 13
## # Groups:   Games [29]
##    Games    NOC    gold silver bronze Season  Year num_athletes host_city region
##    <chr>    <chr> <int>  <int>  <int> <chr>  <int>        <int> <chr>     <chr> 
##  1 1952 Su… DEN       2      1      3 Summer  1952          129 Helsinki  Denma…
##  2 1952 Wi… GER       3      2      2 Winter  1952           53 Oslo      Germa…
##  3 1956 Su… USA      32     25     17 Summer  1956          305 Melbourne Unite…
##  4 1956 Wi… CAN       0      1      2 Winter  1956           35 Cortina … Canada
##  5 1960 Su… RSA       0      1      2 Summer  1960           55 Roma      South…
##  6 1960 Wi… NED       0      1      1 Winter  1960            7 Squaw Va… Nethe…
##  7 1964 Su… ROU       2      4      6 Summer  1964          138 Tokyo     Roman…
##  8 1964 Wi… AUT       4      5      3 Winter  1964           83 Innsbruck Austr…
##  9 1968 Su… BRA       0      1      2 Summer  1968           76 Mexico C… Brazil
## 10 1968 Wi… FRA       4      3      2 Winter  1968           75 Grenoble  France
## 11 1972 Su… ETH       0      0      2 Summer  1972           31 Munich    Ethio…
## 12 1972 Wi… ESP       1      0      0 Winter  1972            3 Sapporo   Spain 
## 13 1976 Su… PUR       0      0      1 Summer  1976           80 Montreal  Puert…
## 14 1976 Wi… FRA       0      0      1 Winter  1976           35 Innsbruck France
## 15 1980 Su… IND       1      0      0 Summer  1980           71 Moskva    India 
## 16 1980 Wi… ITA       0      2      0 Winter  1980           46 Lake Pla… Italy 
## 17 1984 Su… SUI       0      4      4 Summer  1984          129 Los Ange… Switz…
## 18 1984 Wi… SUI       2      2      1 Winter  1984           42 Sarajevo  Switz…
## 19 1988 Su… AUT       1      0      0 Summer  1988           73 Seoul     Austr…
## 20 1988 Wi… SUI       5      5      5 Winter  1988           70 Calgary   Switz…
## 21 1992 Su… QAT       0      0      1 Summer  1992           28 Barcelona Qatar 
## 22 1992 Wi… NZL       0      1      0 Winter  1992            6 Albertvi… New Z…
## 23 1994 Wi… UKR       1      0      1 Winter  1994           37 Lilleham… Ukrai…
## 24 1996 Su… JPN       3      6      5 Summer  1996          306 Atlanta   Japan 
## 25 1998 Wi… KOR       3      1      2 Winter  1998           37 Nagano    Korea…
## 26 2000 Su… CZE       2      3      3 Summer  2000          119 Sydney    Czech…
## 27 2002 Wi… RUS       5      4      4 Winter  2002          151 Salt Lak… Russia
## 28 2004 Su… CMR       1      0      0 Summer  2004           17 Athina    Camer…
## 29 2006 Wi… SUI       5      4      5 Winter  2006          125 Torino    Switz…
## # ℹ 3 more variables: iso3c <chr>, host_country <chr>, mean_gdp <dbl>

Is Wealth Correlated with Olympic Success?

Is there a relationship between the wealth of a country and (i) the number of athletes it sends to the Olympics and (ii) the number of medals that it wins? To address this question, we generated two plots using the combined dataframe that contains GDP data and medals won data.

The first plot evaluates the relationship between GDP on the x-axis and the number of athletes that a country sends on the y-axis. We used a logarithmic scale. The plot also separates summer from winter data with each dot representing a country at any particular year in the Olympics data set. We have included a trend line that shows a positive correlation between the number of athletes participating in the Olympics and the GDP of that country. Countries that have a higher GDP per Capita tend to send more athletes to the Olympics, as we can see more data points concentrated on the right side of the figure. This is possibly due to wealthier countries having access to more resources including advanced training facilities.

In the second plot, GDP is compared with total medals received and the correlation appears to be positive. Countries with higher GDP per Capita appear to win more medals than countries with a lower GDP per Capita. Both plots show that the summer tends to have more athletes participating and more medals being awarded than the winter Olympics. These data suggest that the GDP of a country has a significant impact on whether it will be successful in the Olympics with higher GDP countries being poised to win more medals and have more competing athletes.

Due to the positive correlation between GDP and Olympic success/presence of a country, we became interested in seeing if there were any cases that didn’t follow this trend. To figure out what exceptions exist in these data, we established quartiles and then created an exceptional column that flags a data point as either wealthy-few medals (purple), or poor-many medals (orange). In the first plot we are looking at the number of athletes present that are either normal (gray), poor, or wealthy. There appears to be a respectable presence of athletes from countries with a GDP lower than 3,000 suggesting that even though there’s not a lot of financial support, they are still able to compete. Interestingly enough, athletes from higher GDP countries can also have fewer medals suggesting that a strong economy doesn’t always mean that they’ll send lots of athletes to the Olympics.

In the second plot, we look at the total medals won in comparison with the GDP. Similar to the first plot, we see exceptional cases in regard to poor athletes winning many medals and wealthy athletes not winning many medals. Collectively, these exceptional cases show that there are possibly numerous other factors that don’t concern things like economic wealth or proper training facilities which have played a role in low GDP nations having a strong presence in the Olympics.

Are Host Countries More Likely to Win?

Another natural question to ask is if countries perform better in the years that they are hosting the Olympics. The evidence very weakly hints that there might be a miniscule home-turf advantage, but there is no remotely statistically significant effect of that kind in this dataset.

## # A tibble: 8 × 4
##   was_hosting medal_type boost_average boost_standard_dev
##   <lgl>       <chr>              <dbl>              <dbl>
## 1 TRUE        Gold            0.12                  0.398
## 2 FALSE       Gold           -0.000301              0.328
## 3 TRUE        Silver          0.08                  0.369
## 4 FALSE       Silver         -0.00196               0.354
## 5 TRUE        Bronze          0.07                  0.391
## 6 FALSE       Bronze         -0.00317               0.363
## 7 TRUE        All             0.27                  0.846
## 8 FALSE       All            -0.00543               0.632

In the above table, we computed, for every Olympic Games that a country competed in, the average amount that it got more medals in those games than the previous or next Games of the same season that it competed in (we’ll refer to as the ‘boost’). If there was a host country effect, we would expect that taking the average of “difference in a country’s performance relative to the prior/next same-season Games” for Games which they hosted would yield a markedly higher value than taking the average of the same quantity for Games which a given country had not hosted.

However, for every type of medal (and for total medals won), the difference in average ‘boost’ between the host-country condition and the non-host-country condition was far less than a single standard deviation of that ‘boost’ variable for the dataset of country performances where they were the host and also far less than a single standard deviation of that ‘boost’ variable for the dataset of country performances where they were not the host. To achieve p<0.05 statistical significance (let alone more stringent standards like p<0.01), the average boost in one condition and the average boost in the other condition would need to be at least two standard deviations apart from each other.

The above graph makes the lack of statistical significance quite clear. The middle bar of a given crossbar box is the average boost for a given medal’s count in a given condition (i.e. when a country was playing in an Olympic Games that it was also hosting vs when it was playing in a Games that it wasn’t hosting). The top and bottom of a crossbar box show the two standard deviation range around that average boost size. The 95% confidence interval ranges all absolutely dwarf the difference in average boost size between the two conditions

Is Population Correlated with Olympic Success?

Earlier, we looked at the effect of GDP on Olympic success. However, we could also just look at the population. Does population size of a country affect its performance at the Summer and Winter Olympics? Is Medals Per Capita a fair way to represent it?

In this analysis, we focus in on the years 2002, 1992, 1972, and 1952 Summer and Winter Olympics.

## # A tibble: 7 × 13
## # Groups:   Games [7]
##   Games        gold silver bronze Season  Year country          pop total_medals
##   <chr>       <int>  <int>  <int> <chr>  <int> <chr>          <int>        <int>
## 1 1952 Summer     0      0      1 Summer  1952 Venezuela    5439568            1
## 2 1952 Winter     3      4      2 Winter  1952 Finland      4090500            9
## 3 1972 Summer     7      5      9 Summer  1972 Poland      33039545           21
## 4 1972 Winter     2      5      5 Winter  1972 Norway       3933004           12
## 5 1992 Summer    14      6     11 Summer  1992 Cuba        10723260           31
## 6 1992 Winter     2      1      1 Winter  1992 Korea, Rep. 43805450            4
## 7 2002 Winter     3      2      6 Winter  2002 Switzerland  7361757           11
## # ℹ 4 more variables: medals_PerCapita <dbl>, gold_medals_PerCapita <dbl>,
## #   silver_medals_PerCapita <dbl>, bronze_medals_PerCapita <dbl>

Above, a sampling of this data is shown.

Above is a scatter plot comparing the population sizes of the countries that competed in the 1952, 1972, 1992, and 2002 Summer and Winter Olympics, with the total number of medals that the countries won. A line of best fit using LOESS was introduced to capture any correlation between the two variables. In the Winter Olympics, we do not see a strong correlation, while we can start to see one in the Summer Olympics. It does appear that in the Summer Olympics that as the country’s population increases, the average medals they take home increase, with a very peculiar dip in total medals around a population of 50 million. One thing to note, is that there are fewer countries with lower population sizes that compete in the Winter Olympics, as can be seen on the x axis of this graph. While this data is nice, it tells us nothing about the quality of the earned medals. Here, 6 bronze medals are weighted the same as 6 gold medals. Our next graph will explore this concern.

In these graphs, it shows the total number of medal types by population. Once again, we see roughly the same distribution as before, with the larger populations on average getting more medals of every type. All of the graphs we have presented so far show the raw amounts of medals versus population. This of course makes sense, because a country with a larger population has more athletes statistically to begin with, so their pool of potential Olympians is larger. There are still plenty of large countries with low medal counts though, as we can see in the lower right corners of these graphs. It appears that if you have a larger country, it is on average more likely to earn more medals of any type at the Summer Olympics. This is less-true for our representatives of the Winter Olympics. While this is nice for a first glance, it is a common practice when looking at data sets like this to normalize the data and look for trends once again. As a group, we decided to look at medals per capita as a strategy to ‘normalize’ this data.

Above is a scatter plot a lot like the first one that we first showed for question 3. Except now, the y-axis is medals as per capita. This is how many medals were earned per person in their home country. To find medals per capita, the total number of medals earned by a nation was divided by their population. We were hoping to capture a more ‘fair’ way of representing the Olympic medal data for smaller countries. This analysis suffers from multiple shortcomings. For starters, the Olympics has limits on how many participants can compete from each country. If they did not have these limits, this analysis would be a lot more interesting and accurate. Comparing the populations between these nations is redundant because of this limit on Olympians. Because of this, it ‘weighs’ these medals in an unfair way. For countries with very small populations that place on the podium, their “value” for their medals is going to be greater than those with larger populations.

We can also look at the type of medals on a per capita basis to finish our population effect analysis using medals per capita. Once again, we can see the unfairness of this type of analysis. Interestingly, we can see a diverging effect between the Summer and Winter Olympics. It appears that in the Summer Olympics, the smaller nations tend to bring home more bronzes relative to their population than gold or silver. In the 2002 Winter Olympics, there is no such effect. As was mentioned previously, these medals are weighted heavily in the favor of the lower population nations.

The paradigm between population of home country and performance in the Olympics is more complicated than meets the eye. There are multiple mechanisms built-in to the Olympics that try to account for the inherent differences in competing countries. Despite this, when we look at the total number of medals earned and compare them to the country’ population, we can see a very distinct correlation (at least in the summer Olympics). On average, the larger the population of a country is, the more total medals they earn. No specific type of medal is enriched for in these group. Medals per capita is a ridiculous way to represent country performance at the Olympics. It is an unfair weighted statistic that over represents lower population nations regardless if the analysis is in total medals or medal types.

Conclusion

In this report, we have analyzed the relationship between GDP, population, and Olympic success and broadly found that wealthier and more populous countries have more Olympic Success and win more medals. However, we also found that there are exceptions to this rule and that there is no statistically significant home-turf advantage in the Olympics.