On this website https://www.quebec.ca/agriculture-environnement-et-ressources-naturelles/faune/gestion-faune-habitats-fauniques/especes-fauniques-menacees-vulnerables/liste, there are tables of species that I'd like to extract.
library(rvest)
sp.list = "https://www.quebec.ca/agriculture-environnement-et-ressources-naturelles/faune/gestion-faune-habitats-fauniques/especes-fauniques-menacees-vulnerables/liste"
# Get website
wp.list = read_html(species.list)
# Extract name of sections
headers = wp.list %>% html_elements("h3") %>% html_text2() %>% .[1:24]
# Get tables
tab = read_html(sp.list) %>% html_table(header = TRUE)
# Name tables
names(tab) = headers
# Combine tables
tab.gr = dplyr::bind_rows(tab, .id = "group")
Which gives:
tab.gr
# A tibble: 180 × 3
group Espèce `Nom latin`
<chr> <chr> <chr>
1 Mollusques Anodonte du gaspareau Utterbackiana implicata
2 Mollusques Obovarie olivâtre Obovaria olivaria
3 Insectes Bourdon à tache rousse Bombus affinis
4 Insectes Coccinelle à neuf points Coccinella novemnotata
I was able to get the section headers h2
, but I'm not able to associated them with each h3
sections
get.section = wp.list %>% html_nodes('.frame, .frame-default, .frame-type-textmedia, .frame-layout-0')
pas.dans.cette.page = !grepl(pattern = "Dans cette", x = get.section)
subset.listes = get.section[pas.dans.cette.page]
sections.tables = subset.listes[grep(pattern = "Liste des esp", x = subset.listes)]
sections.tables %>% html_elements("h2") %>% html_text2()
[1] "Liste des espèces menacées"
[2] "Liste des espèces vulnérables"
[3] "Liste des espèces susceptibles d’être désignées comme menacées ou vulnérables"
How then could I get the header (e.g., "Liste des espèces menacées") and its groups (e.g., "Mollusques") with their tables?
rvest
is built on top of xml2
, so knowing some XPath and few (somewhat unintuitive) xml2
tricks can be handy here. For example, we can build a vector of sections that matches our list of table elements by searching for a <h2>
element that preceded each of those tables, basically using table elements as anchor points and tarversing back the HTML tree from each of those. As page structure changes for the last section, we need to adjust tactics a bit for those last tables, but that same strategy still applies.
Another option would be iterating through each section (i.e. processing only tables in that specific section), but because of that structural change it's bit less suitable here.
library(rvest)
library(dplyr)
library(purrr)
url_ <- "https://www.quebec.ca/agriculture-environnement-et-ressources-naturelles/faune/gestion-faune-habitats-fauniques/especes-fauniques-menacees-vulnerables/liste"
html <- read_html(url_)
# table elements
table_elements <- html_elements(html, "table")
# for each table element, find an ancestor with h2, extract that
# some will be missing (NA), but the resulting vector still
# aligns with table element list
sections <-
html_element(table_elements, xpath = "./ancestor::div[h2]/h2") |>
html_text(trim = TRUE)
# for the last section page structure changes; for those tables, find
# ancestor that contains a class with 'frame', from there find a
# preceding sibling that includes h2
sections[is.na(sections)] <-
table_elements[is.na(sections)] |>
html_element(xpath = "./ancestor::div[contains(@class, 'frame')]/preceding-sibling::div[h2][1]/h2") |>
html_text(trim = TRUE)
# captions are included in table elements
groups_ <- html_element(table_elements, "caption") |> html_text()
# process all previuos objects in parallel
pmap(list(table_elements, sections, groups_),
\(t_, s_, g_) html_table(t_) |> mutate(section = s_, group = g_, .before = 1)) |>
list_rbind()
Result:
#> # A tibble: 180 × 4
#> section group Espèce `Nom latin`
#> <chr> <chr> <chr> <chr>
#> 1 Liste des espèces menacées Mollusques Anodonte du gaspareau Utterbackia…
#> 2 Liste des espèces menacées Mollusques Obovarie olivâtre Obovaria ol…
#> 3 Liste des espèces menacées Insectes Bourdon à tache rousse Bombus affi…
#> 4 Liste des espèces menacées Insectes Coccinelle à neuf points Coccinella …
#> 5 Liste des espèces menacées Insectes Cuivré des marais salés Lycaena dos…
#> 6 Liste des espèces menacées Insectes Satyre fauve des Maritimes Coenonympha…
#> 7 Liste des espèces menacées Poissons Chabot de profondeur Myoxocephal…
#> 8 Liste des espèces menacées Poissons Chevalier cuivré Moxostoma h…
#> 9 Liste des espèces menacées Poissons Cisco de printemps Coregonus a…
#> 10 Liste des espèces menacées Poissons Dard de sable Ammocrypta …
#> # ℹ 170 more rows
Created on 2024-03-14 with reprex v2.1.0