Search code examples
rweb-scrapingrvest

Extract hierarchical information from header (h2, h3, tables) with rest


On this website https://www.quebec.ca/agriculture-environnement-et-ressources-naturelles/faune/gestion-faune-habitats-fauniques/especes-fauniques-menacees-vulnerables/liste, there are tables of species that I'd like to extract.

library(rvest)
sp.list = "https://www.quebec.ca/agriculture-environnement-et-ressources-naturelles/faune/gestion-faune-habitats-fauniques/especes-fauniques-menacees-vulnerables/liste"
# Get website
wp.list = read_html(species.list)
# Extract name of sections
headers = wp.list %>%  html_elements("h3") %>% html_text2() %>%   .[1:24]
# Get tables
tab = read_html(sp.list) %>% html_table(header = TRUE)
# Name tables
names(tab) = headers
# Combine tables
tab.gr = dplyr::bind_rows(tab, .id = "group")

Which gives:

tab.gr
# A tibble: 180 × 3
   group      Espèce                     `Nom latin`             
   <chr>      <chr>                      <chr>                   
 1 Mollusques Anodonte du gaspareau      Utterbackiana implicata 
 2 Mollusques Obovarie olivâtre          Obovaria olivaria       
 3 Insectes   Bourdon à tache rousse     Bombus affinis          
 4 Insectes   Coccinelle à neuf points   Coccinella novemnotata

I was able to get the section headers h2, but I'm not able to associated them with each h3 sections

get.section = wp.list %>%  html_nodes('.frame, .frame-default, .frame-type-textmedia, .frame-layout-0')
pas.dans.cette.page = !grepl(pattern = "Dans cette", x = get.section)
subset.listes = get.section[pas.dans.cette.page]
sections.tables = subset.listes[grep(pattern = "Liste des esp", x = subset.listes)]
sections.tables %>%  html_elements("h2") %>%   html_text2()
[1] "Liste des espèces menacées"                                                   
[2] "Liste des espèces vulnérables"                                                
[3] "Liste des espèces susceptibles d’être désignées comme menacées ou vulnérables"

How then could I get the header (e.g., "Liste des espèces menacées") and its groups (e.g., "Mollusques") with their tables?


Solution

  • rvest is built on top of xml2, so knowing some XPath and few (somewhat unintuitive) xml2 tricks can be handy here. For example, we can build a vector of sections that matches our list of table elements by searching for a <h2> element that preceded each of those tables, basically using table elements as anchor points and tarversing back the HTML tree from each of those. As page structure changes for the last section, we need to adjust tactics a bit for those last tables, but that same strategy still applies.

    Another option would be iterating through each section (i.e. processing only tables in that specific section), but because of that structural change it's bit less suitable here.

    library(rvest)
    library(dplyr)
    library(purrr)
    
    url_ <- "https://www.quebec.ca/agriculture-environnement-et-ressources-naturelles/faune/gestion-faune-habitats-fauniques/especes-fauniques-menacees-vulnerables/liste"
    html <- read_html(url_)
    
    # table elements
    table_elements <- html_elements(html, "table")
    
    # for each table element, find an ancestor with h2, extract that
    # some will be missing (NA), but the resulting vector still
    # aligns with table element list
    sections <- 
      html_element(table_elements, xpath = "./ancestor::div[h2]/h2") |> 
      html_text(trim = TRUE)
    
    # for the last section page structure changes; for those tables, find
    # ancestor that contains a class with 'frame', from there find a 
    # preceding sibling that includes h2
    sections[is.na(sections)] <- 
      table_elements[is.na(sections)] |> 
      html_element(xpath = "./ancestor::div[contains(@class, 'frame')]/preceding-sibling::div[h2][1]/h2") |> 
      html_text(trim = TRUE)
    
    # captions are included in table elements
    groups_ <- html_element(table_elements, "caption") |> html_text()
    
    # process all previuos objects in parallel
    pmap(list(table_elements, sections, groups_), 
         \(t_, s_, g_) html_table(t_) |> mutate(section = s_, group = g_, .before = 1)) |>
      list_rbind()
    

    Result:

    #> # A tibble: 180 × 4
    #>    section                    group      Espèce                     `Nom latin` 
    #>    <chr>                      <chr>      <chr>                      <chr>       
    #>  1 Liste des espèces menacées Mollusques Anodonte du gaspareau      Utterbackia…
    #>  2 Liste des espèces menacées Mollusques Obovarie olivâtre          Obovaria ol…
    #>  3 Liste des espèces menacées Insectes   Bourdon à tache rousse     Bombus affi…
    #>  4 Liste des espèces menacées Insectes   Coccinelle à neuf points   Coccinella …
    #>  5 Liste des espèces menacées Insectes   Cuivré des marais salés    Lycaena dos…
    #>  6 Liste des espèces menacées Insectes   Satyre fauve des Maritimes Coenonympha…
    #>  7 Liste des espèces menacées Poissons   Chabot de profondeur       Myoxocephal…
    #>  8 Liste des espèces menacées Poissons   Chevalier cuivré           Moxostoma h…
    #>  9 Liste des espèces menacées Poissons   Cisco de printemps         Coregonus a…
    #> 10 Liste des espèces menacées Poissons   Dard de sable              Ammocrypta …
    #> # ℹ 170 more rows
    

    Created on 2024-03-14 with reprex v2.1.0