I am creating a graph that correlates the life expectancy age and the state pension age for each country. I have used web scraping packages to scrape 2 datasets from 2 Wikipedia pages.
One of the datasets contains the column "Country" and the other dataset contains the column "Country and regions". This is a problem because both datasets need to merge but are unbalanced due to the regions in the "Country and regions" column.
To solve this, I need to remove the regions in "Country and regions", before merging the datasets, so it is balanced. I need to find the unmatched data from "Country and regions" with "Country", remove it, and create one data frame with the 2 datasets.
library(xml2)
library(rvest)
library(stringr)
urlLifeExpectancy <- "https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy"
extractedLifeData = urlLifeExpectancy %>%
read_html() %>%
html_node(xpath = '//*[@id="mw-content-text"]/div/table[1]') %>%
html_table(fill = TRUE)
urlPensionAge <- "https://en.wikipedia.org/wiki/Retirement_age#Retirement_age_by_country"
extractedPensionData = urlPensionAge %>%
read_html() %>%
html_node(xpath = '//*[@id="mw-content-text"]/div/table[3]') %>%
html_table(fill = TRUE)
We can use merge
by selecting the columns which we need from both the datasets
merge(extractedLifeData[c(1, 5, 7)], extractedPensionData[1:3],
by.y = "Country", by.x = "Country and regions")
Or use inner_join
from dplyr
library(dplyr)
extractedLifeData %>% select(1, 5, 7) %>%
inner_join(extractedPensionData %>% select(1:3),
by = c("Country and regions" = "Country"))