Search code examples
rweb-scrapinghtml-tableiterationrvest

trouble scraping html table data in an interval with rvest


Two weeks ago I asked how to scrape html tables with nested columns. With all your help, I can scrape data for one particular day and filter out irrelevant row information:

library(rvest)
library(dplyr)
library(tidyverse)

theDate <- Sys.Date() - 7
theDateInNumber <- gsub("\\-", "", Sys.Date() - 7)

url_data <- paste0("https://www.immd.gov.hk/eng/stat_", theDateInNumber, ".html")

rows <- read_html(url_data) %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())

df <- map_dfr(
  rows,
  function(x) {
x %>%
  html_elements("td[headers]") %>%
  set_names(headers) %>%
  html_text()
  }
) %>%
  filter(Control_Point %in% c("Airport")) %>% #select only airport data
  mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer())) %>%
  mutate(date = theDate)

write.csv(df, "immigrationStatistics.csv")

view(df)

enter image description here

This time I try to scrape the same type of data -- airport travel figures in any date range. My goal is to obtain a table of airport traffic and plot a line chart on population change in an interval. But I find trouble in iteration.

My code is as follows:

library(rvest)
library(dplyr)
library(tidyverse)


start <- as.Date("01-09-22", format = "%d-%m-%y")
end   <- as.Date("30-09-22", format = "%d-%m-%y")


prefixes <- c("arr", "dep")
cols <-
  c("Hong Kong Residents",
    "Mainland Visitors",
    "Other Visitors",
    "Total")
headers <-
  c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())


theDate <- start
while (theDate <= end)
{
  url_data <-
    print(paste0("https://www.immd.gov.hk/eng/stat_", format(theDate, "%Y%m%d"), ".html"
    ))
  
  rows <-
    read_html(url_data) %>% html_elements(".table-passengerTrafficStat tbody tr")
 
  df <- map_dfr(rows,
                function(x) {
                  x %>%
                    html_elements("td[headers]") %>%
                    set_names(headers) %>%
                    html_text()
                }) %>%
    filter(Control_Point %in% c("Airport")) %>% #select only airport data
    mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer())) %>%
    mutate(date = theDate - 1) %>%
    write.csv(df, "immigrationStatistics.csv")
  
  theDate <- theDate + 1
}
view(df)

May I know why and where the error occurs? And how to fix the iteration method? The console complains that:

[1] "https://www.immd.gov.hk/eng/stat_20220901.html"
Error in file == "" : 
  comparison (1) is possible only for atomic and list types
> view(df)
Error in checkHT(n, dim(x)) : 
  invalid 'n' -  must contain at least one non-missing element, got none.

Thanks a million in advance.


Solution

  • I was unable to reproduce your error. However, I did made the change of collecting the results of each loop into a list and then writing the information to a file just once. It looks like your original code would overwrite the data file on each iteration.

    library(rvest)
    library(dplyr)
    library(purrr)
    library(stringr)
    
    start <- as.Date("01-09-22", format = "%d-%m-%y")
    end   <- as.Date("3-09-22", format = "%d-%m-%y")
    
    prefixes <- c("arr", "dep")
    cols <-
       c("Hong Kong Residents",
         "Mainland Visitors",
         "Other Visitors",
         "Total")
    headers <-
       c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())
    
    answer <- list()
    theDate <- start
    while (theDate <= end) {
       url_data <-
          print(paste0("https://www.immd.gov.hk/eng/stat_", format(theDate, "%Y%m%d"), ".html"
          ))
       
       rows <-
          read_html(url_data) %>% html_elements(".table-passengerTrafficStat tbody tr")
       
       df <- map_dfr(rows,
                     function(x) {
                        x %>%
                           html_elements("td[headers]") %>%
                           set_names(headers) %>%
                           html_text()
                     })  %>%
          filter(Control_Point %in% c("Airport")) %>% #select only airport data
          mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer())) %>%
          mutate(date = theDate - 1)         
       answer[[theDate]] <-df
          
       theDate <- theDate + 1
       Sys.sleep(1)
    }
    #bind_rows(answer)
    write.csv(bind_rows(answer), "immigrationStatistics.csv")
    

    The final change was to add a slight pause as not to appear as an attack.