Search code examples
rweb-scrapinghrefrvest

Web scraping with R: can't see the downloadable links


I am trying to download some .xlsx files from this kind of webpage EDIT or this one. However, when I want to display the source code (right click --> view source code), I can't see all the content of the actual webpage (just the header and the footer).

I tried to use the rvest to display the downloadable links but same here, it returns only the ones from the header and the footer:

library(rvest)
html("https://b2share.eudat.eu/records/8d47a255ba5749e3ac169527e22f0068") %>% 
     html_nodes("a")

Returns:

#{xml_nodeset (5)}
#[1] <a href="https://eudat.eu">Go to EUDAT website</a>
#[2] <a href="https://eudat.eu"><img src="/img/logo_eudat_cdi.svg" alt="EUDAT CDI logo" style="max-width: 200px"></a>
#[3] <a href="https://www.eudat.eu/eudat-cdi-aup">Acceptable Use #Policy </a>
#[4] <a href="https://eudat.eu/privacy-policy-summary">Data Privacy Statement</a>
#[5] <a href="https://eudat.eu/what-eudat">About EUDAT</a>

Any idea how to access the content of the all page?


Solution

  • You need to pass the record id to an API endpoint which provides the parts to construct the file download links as follows:

    library(jsonlite)
    
    d <- jsonlite::read_json('https://b2share.eudat.eu/api/records/8d47a255ba5749e3ac169527e22f0068')
    
    files <- paste(d$links$files, d$files[[1]]$key , sep = '/')
    

    For re-use, you can re-write as a function accepting the start link as argument:

    library(jsonlite)
    library(stringr)
    
    get_links <- function(link){
      record_id <- tail(str_split(link, '/')[[1]], 1)
      d <- jsonlite::read_json(paste0('https://b2share.eudat.eu/api/records/', record_id))
      links <- paste(d$links$files, d$files[[1]]$key , sep = '/')
      return(links)
    }
    
    get_links('https://b2share.eudat.eu/records/ce32a67a789b44a1a15965fd28a8cb17')
    get_links('https://b2share.eudat.eu/records/8d47a255ba5749e3ac169527e22f0068')
    

    Which you could simplify to:

    library(jsonlite)
    
    get_links <- function(record_id){
      d <- jsonlite::read_json(paste0('https://b2share.eudat.eu/api/records/', record_id))
      links <- paste(d$links$files, d$files[[1]]$key , sep = '/')
      return(links)
    }
    
    get_links('ce32a67a789b44a1a15965fd28a8cb17')
    get_links('8d47a255ba5749e3ac169527e22f0068')