What is/are the corresponding identifier variables?
Are the identifier variables in common?
Or do they have to be added/transformed to match?
Merging WB and V-Dem Data
These are both time-series, country-level data
Need to merge by country-year
Year is easy
But there are many different country codes
Can use countrycode package to assign country codes
countrycode example
# Load countrycodelibrary(countrycode)# Create new iso3c variabledemocracy <- democracy |>mutate(iso3c =countrycode(sourcevar = vdem_ctry_id, # what we are convertingorigin ="vdem", # we are converting from vdemdestination ="wb")) |># and converting to the WB iso3c code relocate(iso3c, .after = vdem_ctry_id) # move iso3c # View the dataglimpse(democracy)
Try it Yourself
Using your democracy data frame from the last lesson
Use mutate() and countrycode() to add iso3c country codes
Use relocate to move your iso3c code to the “front” of your data frame (optional)
05:00
Types of joins in dplyr
Mutating versus filtering joins
Four types of mutating joins
inner_join()
full_join()
left_join()
right_join()
For the most part we will use left_join()
left_join() example
# Load readrlibrary(readr)# Perform left join using common iso3c variable and yeardem_women <-left_join(democracy, women_emp, by =c("iso3c", "year")) |>rename(country = country.x) |># rename country.xselect(!country.y) # crop country.y# Save as .csv for future usewrite_csv(dem_women, "data/dem_women.csv")# View the dataglimpse(dem_women)
Try it Yourself
Take your V-Dem data frame and your World Bank data frame
Using left_join() to merge on country code and year
Along the way, use rename() and select() to insure you have just one country name
05:00
Group, Summarize and Arrange
group_by(), summarize(), arrange()
A very common sequence in data science:
Take an average or some other statistic for a group
Rank from high to low values of summary value
Example
# group_by(), summarize() and arrange()dem_summary <- dem_women |># save result as new objectgroup_by(region) |># group dem_women data by regionsummarize( # summarize following vars (by region)polyarchy =mean(polyarchy, na.rm =TRUE), # calculate mean, remove NAsgdp_pc =mean(gdp_pc, na.rm =TRUE), flfp =mean(flfp, na.rm =TRUE), women_rep =mean(women_rep, na.rm =TRUE) ) |>arrange(desc(polyarchy)) # arrange in descending order by polyarchy score# Save as .csv for future usewrite_csv(dem_summary, "data/dem_summary.csv")# View the dataglimpse(dem_summary)
Try it Yourself
Try running a group_by(), summarize() and arrange() in your Quarto document
Try changing the parameters to answer these questions:
Try summarizing the data with a different function for one or more of the variables.
What is the median value of polyarchy for The West?
What is the max value of gdp_pc for Eastern Europe?
What is the standard deviation of flfp for Africa?
What is the interquartile range of women_rep for the Middle East?
Now try grouping by country instead of region.
What is the median value of polyarchy for Sweden?
What is the max value of gdp_pc New Zealand?
What is the standard deviation of flfp for Spain?
What is the interquartile range of women_rep for Germany?
Sort countries in descending order based on the mean value of gdp_pc (instead of the median value of polyarchy). Which country ranks first based on this sorting?
Now try sorting countries in ascending order based on the median value of women_rep (hint: delete “desc” from the arrange() call). Which country ranks at the “top” of the list?