I am having problems working with a dataframe in R. I have downloaded a table of the world happiness report from wikipedia. https://de.wikipedia.org/wiki/World_Happiness_Report However the table took every name of the country and printed it twice. "China China" ... How do I get rid of the duplicate?
This is my code:
library(tidyverse)
library(rvest)
library(stringr)
link <- "https://de.wikipedia.org/wiki/World_Happiness_Report"
df <- read_html(link) %>%
html_element("table.wikitable") %>%
html_table()
This is what I tried so far:
df$Land <- lapply(df$Land, function(x) unique(str_split(x, " ")[[1]]))
However if I print "df" I get " <chr [1]> " in the "Land" column. If i just print df$Land I get all country names just as I had them at the beginning: [[1]] [1] "Finnland Finnland"
Almost the same happens when I tried:
df$Land <- unique(str_split(df$Land, " +")[[1]])
Can someone please help me or show me a place where the question has already been answered? Thank you
If you look at the wikipedia page, you can see these aren't duplicates but the names of the icons in that column which are enclosed in <span>
tags. These are then coerced to text using html_table()
. You could remove the icons from the data before you apply html_table()
:
library(xml2)
link <- "https://de.wikipedia.org/wiki/World_Happiness_Report"
raw_data <- read_html(link)
spans <- raw_data %>%
html_nodes(xpath = "//*/tr/td/span")
xml2::xml_remove(spans)
df <- raw_data %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table') %>%
html_table()
df[[1]]