Search code examples
htmlrweb-scrapingrvest

Parsing rvest output from an unstructured infobox


I am attempted to extract data from a wiki fandom website using the rvest package in R. However, I am running into several issues because the infobox is not structured as an HTML table. Please see below for my attempts at dealing with this issue:

library(tidyverse)
library(data.table)
library(rvest)
library(httr)

url <- c("https://starwars.fandom.com/wiki/Anakin_Skywalker")

#See here that the infobox information does not appear when checking for HTML tables in the page
df <- read_html(url) %>%
  html_table()

#So now just extract data using the CSS selector
df <- read_html(url) %>%
  html_element("aside")
  html_text2()

The second attempt does succeed at extracting the raw data, but it is formatted in a way that is not easy to format into a clean dataframe. So, then I attempted to extract each element of the table individually, which might be easier to clean and structure into a dataframe. However, when I attempt to do so using the XPath, I get an empty result:

df <- read_html(url) %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/aside/section[1]') %>%
  html_text2() 

So I suppose my question is primarily: does anyone know of a good way to automatically extract the infobox in a datarfame friendly format? If not, would someone be able to point me towards why my attempt to extract each panel individually is not working?


Solution

  • If you target the div.pi-data directly, you could do something like this:

    bind_rows(
      read_html(url) %>%
        rvest::html_nodes("div.pi-data") %>% 
        map(.f = ~tibble(
          label = html_elements(.x, ".pi-data-label") %>% html_text2(),
          text= html_elements(.x, ".pi-data-value") %>% html_text2() %>% strsplit(split="\n")
        ) %>% unnest(text)
        )
    )
    

    Output:

    # A tibble: 29 x 2
       label      text                                                              
       <chr>      <chr>                                                             
     1 Homeworld  Tatooine[1]                                                       
     2 Born       41 BBY,[2] Tatooine[3]                                            
     3 Died       4 ABY,[4]DS-2 Death Star II Mobile Battle Station, Endor system[5]
     4 Species    Human[1]                                                          
     5 Gender     Male[1]                                                           
     6 Height     1.88 meters,[1] later 2.03 meters (6 ft, 8 in) in armor[6]        
     7 Mass       120 kilograms in armor[7]                                         
     8 Hair color Blond,[8] light[9] and dark[10]                                   
     9 Eye color  Blue,[11] later yellow (dark side)[12]                            
    10 Skin color Light,[11] later pale[5]                                          
    # ... with 19 more rows