Search code examples
htmlrweb-scrapingrvestnested-table

How to scrape HTML table with nested column with Rvest?


I encounter a big problem in scrapping of HTML table with nested columns.

The table is from the immigration department of Hong Kong.

A screenshot is shown here:

enter image description here

I tried to do it with rvest, but the result is messy.

library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)

url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"

url_data %>%
read_html()
css_selector <- "body > section:nth-child(7) > div > div > div > div > table"
immiTable <- url_data %>% 
read_html() %>% html_element(css = css_selector) %>% html_table()
immiTable

enter image description here

My goal is to extract the first row (i.e. Airport) and plot it to a pie chart, and produce a dataframe of the whole table and save it to excel.

I realize that teaching material for unnest table and scrapping nested table is rather scarce. Therefore I need your guidance. Thank you very much for your help.


Solution

  • An alternative would be to select the tbody rows, filtering out the hidden items by attribute, then add in the headers later.

    library(rvest)
    library(tidyverse)
    
    rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
    prefixes <- c("arr", "dep")
    cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
    headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
    
    df <- map_dfr(
      rows,
      function(x) {
        x %>%
          html_elements("td[headers]") %>%
          set_names(headers) %>%
          html_text()
      }
    ) %>%
      mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))
    

    Or somewhat condensed,

    library(rvest)
    library(tidyverse)
    
    rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
    prefixes <- c("arr", "dep")
    cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
    headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
    
    df <- map_dfr(rows, ~ set_names(.x %>% html_elements("td[headers]") %>% html_text(), headers)) %>%
      mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))