Search code examples
rduplicatesdata-manipulationanti-join

How to remove unmatched data from two data frames, to create a new data frame in R


I am creating a graph that correlates the life expectancy age and the state pension age for each country. I have used web scraping packages to scrape 2 datasets from 2 Wikipedia pages.

One of the datasets contains the column "Country" and the other dataset contains the column "Country and regions". This is a problem because both datasets need to merge but are unbalanced due to the regions in the "Country and regions" column.

To solve this, I need to remove the regions in "Country and regions", before merging the datasets, so it is balanced. I need to find the unmatched data from "Country and regions" with "Country", remove it, and create one data frame with the 2 datasets.

library(xml2)
library(rvest)
library(stringr)

urlLifeExpectancy <- "https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy"

extractedLifeData = urlLifeExpectancy %>%
  read_html() %>%
  html_node(xpath = '//*[@id="mw-content-text"]/div/table[1]') %>%
  html_table(fill = TRUE)

urlPensionAge <- "https://en.wikipedia.org/wiki/Retirement_age#Retirement_age_by_country"

extractedPensionData = urlPensionAge %>%
  read_html() %>%
  html_node(xpath = '//*[@id="mw-content-text"]/div/table[3]') %>%
  html_table(fill = TRUE)

Solution

  • We can use merge by selecting the columns which we need from both the datasets

    merge(extractedLifeData[c(1, 5, 7)], extractedPensionData[1:3], 
           by.y = "Country", by.x = "Country and regions")
    

    Or use inner_join from dplyr

    library(dplyr)
    
    extractedLifeData %>% select(1, 5, 7) %>%
         inner_join(extractedPensionData %>% select(1:3), 
                    by = c("Country and regions" = "Country"))