Search code examples
rrvest

Downloaded tabular data contains unexpected results


I'm trying to download some tabular data from the website https://econservation.jrc.ec.europa.eu/recordeddonors?title_selective=All&field_project_year_value_selective=All using the R package rvest.

I want to download the data from the table at the bottom of the website, but to access all of it, you have to click through the table with the "next" button below the table.

The URL behind the "next" button leads to https://econservation.jrc.ec.europa.eu/recordeddonors?title_selective=All&field_project_year_value_selective=All&page=1.

So in order to download the first two partitions, I would use this code:

library(rvest)

link_base <- https://econservation.jrc.ec.europa.eu/recordeddonors?title_selective=All&field_project_year_value_selective=All
link_base_page <- https://econservation.jrc.ec.europa.eu/recordeddonors?title_selective=All&field_project_year_value_selective=All&page=
 
get_content <- function(page_url) {
  print(page_url)

  econversation_page <- read_html(page_url)
  tbl_content <- econversation_page %>%
    html_nodes(xpath='//*[@id="views-aggregator-datatable"]') %>%
    html_table()

  return(tbl_content [[1]])
}

urls <- c(
  link_base,
  paste0(link_base_page, 1)
)

tables <- lapply(urls, get_content)

However, tables[[1]] and tables[[2]] both contain the same data. The second link somehow does not lead to the second partition of the table.

Any ideas what I'm missing here?


Solution

  • Next button triggers an Ajax POST call, to mimic it we may use httr / httr2. Ajax response is a JSON with an embedded HTML table, so rvest still gets some use. Regarding dataset, all of those donors should also have a decent database with API interface or at least with an option to export.

    Example bellow collects first 5 pages / 50 records in total, out of 3105.

    library(httr2)
    library(purrr)
    library(rvest)
    library(dplyr)
    
    get_page <- function(page){
      # Ajax POST call to fetch a page of 10 results, 
      # Returns JSON response with html table
      html <- request("https://econservation.jrc.ec.europa.eu/views/ajax") %>% 
        req_url_query(`title_selective` = "All", `field_project_year_value_selective` = "All") %>% 
        req_body_form(
          `page` = page,
          `view_name` = "donors_page",
          `view_display_id` = "page_2",
          `view_args` = "",
          `view_path` = "recordeddonors",
          `view_base_path` = "testluca",
          `pager_element` = "0",
          `title_selective` = "All",
          `field_project_year_value_selective` = "All") %>% 
        req_perform() %>% 
        resp_body_json() %>% 
        pluck(3, "data") %>% 
        minimal_html() 
    
      # parse table, misses URLs 
      tab <- html %>% 
        html_element("table") %>% 
        html_table() %>% 
        select(-1)
      
      # collect URLs,
      urls <- html %>% 
        html_element("table tbody") %>% 
        html_elements("tr") %>% 
        {
          list(url_ext   = html_element(.,"td.views-field-field-project-link a") %>% html_attr("href"),
               url_int   = html_element(.,"td.views-field-title a") %>% html_attr("href"),
               url_donor = html_element(.,"td.views-field-title-2 a") %>% html_attr("href"))
        }
      bind_cols(tab, urls)
    }
    
    # test with 5 first pages (0 .. 4), returns tibble with 50 records
    donors <- map(0:4, get_page) %>% 
      list_rbind()
    

    Results:

    donors
    #> # A tibble: 50 × 9
    #>    PROJECT         ACTIO…¹ TIME …² FUNDI…³ DONOR PROGRAM url_ext url_int url_d…⁴
    #>    <chr>           <chr>   <chr>   <chr>   <chr> <chr>   <chr>   <chr>   <chr>  
    #>  1 Coastal Waters… 1.Land… From 1… 200,85… The … World … "http:… /proje… /donor…
    #>  2 Ports Developm… 2.Land… From 9… 188,19… The … World … "http:… /proje… /donor…
    #>  3 Environment Pr… 1.Land… From 1… 155,00… Glob… GEF Tr… "https… /proje… /donor…
    #>  4 First Programm… 2.Land… From 3… 130,00… The … World … "https… /proje… /donor…
    #>  5 Northern Borde… 2.Land… From 6… 128,28… The … World … "http:… /proje… /donor…
    #>  6 First Programm… 4.Educ… From 8… 120,60… The … World … "http:… /proje… /donor…
    #>  7 Peru First Pro… 1.Land… From 2… 108,90… The … World … "http:… /proje… /donor…
    #>  8 Philippine Rur… 2.Land… From 8… 101,40… The … World … "http:… /proje… /donor…
    #>  9 BR Marine Prot… 1.Land… From 9… 94,288… The … World … "http:… /proje… /donor…
    #> 10 Mato Grosso Na… 1.Land… From 6… 94,281… The … World … "https… /proje… /donor…
    #> # … with 40 more rows, and abbreviated variable names ¹​`ACTION TYPE`,
    #> #   ²​`TIME EXTENT`, ³​`FUNDING (US $)`, ⁴​url_donor
    glimpse(donors)
    #> Rows: 50
    #> Columns: 9
    #> $ PROJECT          <chr> "Coastal Watersheds Conservation in the Context of Cl…
    #> $ `ACTION TYPE`    <chr> "1.Land/water protection", "2.Land/water management, …
    #> $ `TIME EXTENT`    <chr> "From 11/2013 to 6/2019", "From 9/1994 to 12/2003", "…
    #> $ `FUNDING (US $)` <chr> "200,850,000", "188,190,000", "155,000,000", "130,000…
    #> $ DONOR            <chr> "The World Bank", "The World Bank", "Global Environme…
    #> $ PROGRAM          <chr> "World Bank", "World Bank", "GEF Trust Fund", "World …
    #> $ url_ext          <chr> "http://projects.worldbank.org/P131709?lang=en ", "ht…
    #> $ url_int          <chr> "/project/1000050", "/project/1000511", "/project/500…
    #> $ url_donor        <chr> "/donor/10", "/donor/10", "/donor/12", "/donor/10", "…
    

    Created on 2023-03-16 with reprex v2.0.2