A couple of months ago, I built a visualization project that involved merging three data sets.
- The College of William and Mary maintains an Aid database that shows international aid from and to both countries and agencies.
- I added Gapminder's information about population so that I could look at per capita donations
- And I used a Country/Continent listing to consolidate data by continent
I also used the last data set as part of an inner join that was designed to eliminate all non-governmental donors from my final data.
I was curious about inter-governmental transfers, because countries spend a lot of time in the media bragging about how much they are doing for international aid, and I wanted to see what was true. (For some preliminary answers, which may come with more questions, check the original post. It contains an interactive Tableau visualization for drilling down by country, aid type, and decade.)
At the beginning of mapping country names across all three files, I had 665,317 lines in my final merged file that I was using to add population data to the original Aid data. I was very pleased with my outcomes, and I gathered my family in the living room to show them on the big television. My husband said, "Oh, didn't you have any data for the Democratic Republic of Congo?" And I said, "Oh. I think I need to take another look."
It turned out that the three datasets used different abbreviations for the longer country names, so the inner joins were eliminating entire countries from the final data.
This process was a bit time consuming; it probably took me six hours to track down the various countries across all the data sets, partly due to the size of the data and how long it took to do the loads and merges on my 2014 MacBook Air. It was definitely worth the effort though, as I will show below.
I opened up three different views of the data, because I needed to be able to see what was in the original set (Excel), what my mapping looked like (Python), and how they were matching in the final visualization (Tableau). Poor old computer.
To get back a view of all the original data in Tableau, I changed my "inner" join to a left join. I made a sheet that was just a table showing me all the original recipients and donors, whether or not they had been included in the final merged file.
This was a point where visual inspection was absolutely necessary: In one of the original files, the apostrophe in Cote d'Ivoire was encoded as a backtick... presumably because somebody had been using single quotes in their code and didn't want to escape them? Somebody, somewhere, at some point in the past made that decision anyway. This discovery recovered ~ 5000 lines of missing data for the final visualization.
I'm afraid that it only occurred to me to start tracking the specific outcomes near the end of the day, so I only have the alphabetical tail of the data to show you...
Venezuela: Missing 4000 rows totaling $48B
Yemen: 7600 rows, $29B
Yugoslavia: 185, $19B (Much of this came from the World Bank, and so it won't show up in this particular visualization, since it is about the relationships between governments.)
End of Process
In the end, however, I had 792,541 lines in my final data set, compared with the 665,317 I started with. This means that my mapping process resulted in 127,224 additional rows, or 19% more data than I originally was working with.
Two notes - requests for insights
I still am unclear on which abbreviations map correctly to "Korea" in the two larger datasets... The two countries are the "Republic of Korea" and the "Democratic People's Republic of Korea," but one of my datasets has an abbreviation for the "Democratic Republic of Korea." I wasn't sure what to do with that, even after several searches. In the end, I treated all entries labelled "Korea" in any way as Korea, Republic of... This added 16,000 lines to the output file, although Tableau's map calls it South Korea, so it still doesn't show up on the map.
Also, depsite my best efforts, I cannot get Poland to appear in the final data set. My assumption is that there is a unicode problem, probably involving a Cyrillic set of characters that my applications can translate to my fonts, but Python is treating more strictly. I tried using Regular Expressions. I tried matching only on the beginning of the country name. I tried using different encoding. I spent several hours trying to get Poland to show up, but it is always eliminated in the Python mapping step. If anybody has additional suggestions, I would be grateful.