Search code examples
rlistdataframeuniquestrsplit

Splitting character string inside a List and deleting the second element in R


I am having problems working with a dataframe in R. I have downloaded a table of the world happiness report from wikipedia. https://de.wikipedia.org/wiki/World_Happiness_Report However the table took every name of the country and printed it twice. "China China" ... How do I get rid of the duplicate?

This is my code:

library(tidyverse)
library(rvest)
library(stringr)

link <- "https://de.wikipedia.org/wiki/World_Happiness_Report"

df <- read_html(link) %>%
html_element("table.wikitable") %>%
html_table()

This is what I tried so far:

df$Land <- lapply(df$Land, function(x) unique(str_split(x, " ")[[1]]))

However if I print "df" I get " <chr [1]> " in the "Land" column. If i just print df$Land I get all country names just as I had them at the beginning: [[1]] [1] "Finnland Finnland"

Almost the same happens when I tried:

df$Land <- unique(str_split(df$Land, " +")[[1]])

Can someone please help me or show me a place where the question has already been answered? Thank you


Solution

  • If you look at the wikipedia page, you can see these aren't duplicates but the names of the icons in that column which are enclosed in <span> tags. These are then coerced to text using html_table(). You could remove the icons from the data before you apply html_table():

    library(xml2)
    link <- "https://de.wikipedia.org/wiki/World_Happiness_Report"
    raw_data <- read_html(link)
    spans <- raw_data %>% 
      html_nodes(xpath = "//*/tr/td/span")
    xml2::xml_remove(spans)
    df <- raw_data %>% 
      html_nodes(xpath='//*[@id="mw-content-text"]/div/table') %>%
      html_table() 
    df[[1]]