Introduction
The Olympics, first held in 1896, have been a global event for over 120 years. The games have featured the best athletes from around the world, competing in a variety of sports. Although the Olympics are seen as the ultimate test of fair competition at the highest level, not all countries are equal in terms of their success.
Some might argue that the success of a country in the Olympics is related to the wealth of that country. Others might claim that larger countries with more people have a stark advantage. Still others might argue that being the host country reaps significant benefits.
In reality, the true relationship is likely a combination of all three (and many other factors). In this report, we will explore those three aformentioned factors and their relationship to Olympic success using data from past events and GapMinder socioeconomic data.
Loading the Data
See the included .Rmd
file for the full code.
- First, we loaded in the Kaggle
Data which includes
athlete_events.csv
(contains a record for every athlete in every event in the Olympic Games from 1896 to 2016)- Includes the medal they won in a particular event (if any)
- The Kaggle data also included
noc_regions.csv
which stores the National Organizing Committee (NOC) regions represented in the Olympics.
Then, we load in the GapMinder data which includes socioeconomic data of countries collected every 5 years from 1952 to 2007.
To facilitate the merging of the data, we also loaded in the country codes data, which helps us merge the GapMinder data with the Olympic data by including ISO 3166-1 country codes.
And finally, we load in the data which contains the host cities and corresponding countries for each of the Olympic Games from
host_cities.csv
Cleaning the Data
We repaired the
noc_regions
data by addingSGP
as the preferred NOC for Singapore overSIN
since 2016There were a few inconsistencies between the region names in
noc_regions
and those incountry_codes
. We corrected these inconsistencies by manually specifying these corrections in a list and then using these corrections to update thenoc_regions
data. Some inconsistencies were Palestine being referred to as West Bank and Gaza and mispellings like Boliva instead of Bolivia to just name a few.
Merging the Data
First, we obtained the number of medals won by each country in each year/season combination from
athlete_events
Then, we extracted the city and year/season combination for each unique Olympic games.
After that, we calculated the total number of athletes by counting the total number of distinct names in
athlete_events
and then joined this together with the two previous dataframes to get a comprehensive dataframe of medals won by each country in each year/season combination.This, in itself, isn’t enough to answer the questions we have posed. We need to merge this data with the GapMinder data to include the GDP and Population. But first, we need to add the country codes from
country_codes
to the GapMinder data.The dataframe with medals won we created earlier uses the IOC country codes, while this GapMinder data uses ISO 3166-1 codes. To combine these two dataframes, we need to convert the IOC codes to ISO codes which can be done using the
countrycode
package.Since the GapMinder data only goes from 1952 to 2007, we will only be able to compare GDP and population data in this timeframe.
Additionally, the GapMinder data is collected every 5 years, so to match it with the Olympics data, we take the approach of interpolating in between years by averaging the 3 years before and after the year in question.
- This will preserve the GDP and Population data for years that match the Olympic Data and average the GDP in years that fall in between the years of the GapMinder data.
Finally, we merge the GapMinder data with the medals won data to get a comprehensive dataframe that includes GDP and Population data for each country along with the number of bronze, silver, and gold medals won by that country in each year/season combination. This is stored in the
df_combined
dataframe and one random country for each year is printed in the table below.
## # A tibble: 29 × 13
## # Groups: Games [29]
## Games NOC gold silver bronze Season Year num_athletes host_city region
## <chr> <chr> <int> <int> <int> <chr> <int> <int> <chr> <chr>
## 1 1952 Su… DEN 2 1 3 Summer 1952 129 Helsinki Denma…
## 2 1952 Wi… GER 3 2 2 Winter 1952 53 Oslo Germa…
## 3 1956 Su… USA 32 25 17 Summer 1956 305 Melbourne Unite…
## 4 1956 Wi… CAN 0 1 2 Winter 1956 35 Cortina … Canada
## 5 1960 Su… RSA 0 1 2 Summer 1960 55 Roma South…
## 6 1960 Wi… NED 0 1 1 Winter 1960 7 Squaw Va… Nethe…
## 7 1964 Su… ROU 2 4 6 Summer 1964 138 Tokyo Roman…
## 8 1964 Wi… AUT 4 5 3 Winter 1964 83 Innsbruck Austr…
## 9 1968 Su… BRA 0 1 2 Summer 1968 76 Mexico C… Brazil
## 10 1968 Wi… FRA 4 3 2 Winter 1968 75 Grenoble France
## 11 1972 Su… ETH 0 0 2 Summer 1972 31 Munich Ethio…
## 12 1972 Wi… ESP 1 0 0 Winter 1972 3 Sapporo Spain
## 13 1976 Su… PUR 0 0 1 Summer 1976 80 Montreal Puert…
## 14 1976 Wi… FRA 0 0 1 Winter 1976 35 Innsbruck France
## 15 1980 Su… IND 1 0 0 Summer 1980 71 Moskva India
## 16 1980 Wi… ITA 0 2 0 Winter 1980 46 Lake Pla… Italy
## 17 1984 Su… SUI 0 4 4 Summer 1984 129 Los Ange… Switz…
## 18 1984 Wi… SUI 2 2 1 Winter 1984 42 Sarajevo Switz…
## 19 1988 Su… AUT 1 0 0 Summer 1988 73 Seoul Austr…
## 20 1988 Wi… SUI 5 5 5 Winter 1988 70 Calgary Switz…
## 21 1992 Su… QAT 0 0 1 Summer 1992 28 Barcelona Qatar
## 22 1992 Wi… NZL 0 1 0 Winter 1992 6 Albertvi… New Z…
## 23 1994 Wi… UKR 1 0 1 Winter 1994 37 Lilleham… Ukrai…
## 24 1996 Su… JPN 3 6 5 Summer 1996 306 Atlanta Japan
## 25 1998 Wi… KOR 3 1 2 Winter 1998 37 Nagano Korea…
## 26 2000 Su… CZE 2 3 3 Summer 2000 119 Sydney Czech…
## 27 2002 Wi… RUS 5 4 4 Winter 2002 151 Salt Lak… Russia
## 28 2004 Su… CMR 1 0 0 Summer 2004 17 Athina Camer…
## 29 2006 Wi… SUI 5 4 5 Winter 2006 125 Torino Switz…
## # ℹ 3 more variables: iso3c <chr>, host_country <chr>, mean_gdp <dbl>
Are Host Countries More Likely to Win?
Another natural question to ask is if countries perform better in the years that they are hosting the Olympics. The evidence very weakly hints that there might be a miniscule home-turf advantage, but there is no remotely statistically significant effect of that kind in this dataset.
## # A tibble: 8 × 4
## was_hosting medal_type boost_average boost_standard_dev
## <lgl> <chr> <dbl> <dbl>
## 1 TRUE Gold 0.12 0.398
## 2 FALSE Gold -0.000301 0.328
## 3 TRUE Silver 0.08 0.369
## 4 FALSE Silver -0.00196 0.354
## 5 TRUE Bronze 0.07 0.391
## 6 FALSE Bronze -0.00317 0.363
## 7 TRUE All 0.27 0.846
## 8 FALSE All -0.00543 0.632
In the above table, we computed, for every Olympic Games that a country competed in, the average amount that it got more medals in those games than the previous or next Games of the same season that it competed in (we’ll refer to as the ‘boost’). If there was a host country effect, we would expect that taking the average of “difference in a country’s performance relative to the prior/next same-season Games” for Games which they hosted would yield a markedly higher value than taking the average of the same quantity for Games which a given country had not hosted.
However, for every type of medal (and for total medals won), the difference in average ‘boost’ between the host-country condition and the non-host-country condition was far less than a single standard deviation of that ‘boost’ variable for the dataset of country performances where they were the host and also far less than a single standard deviation of that ‘boost’ variable for the dataset of country performances where they were not the host. To achieve p<0.05 statistical significance (let alone more stringent standards like p<0.01), the average boost in one condition and the average boost in the other condition would need to be at least two standard deviations apart from each other.
The above graph makes the lack of statistical significance quite clear. The middle bar of a given crossbar box is the average boost for a given medal’s count in a given condition (i.e. when a country was playing in an Olympic Games that it was also hosting vs when it was playing in a Games that it wasn’t hosting). The top and bottom of a crossbar box show the two standard deviation range around that average boost size. The 95% confidence interval ranges all absolutely dwarf the difference in average boost size between the two conditions
Conclusion
In this report, we have analyzed the relationship between GDP, population, and Olympic success and broadly found that wealthier and more populous countries have more Olympic Success and win more medals. However, we also found that there are exceptions to this rule and that there is no statistically significant home-turf advantage in the Olympics.