I am trying to download some .xlsx
files from this kind of webpage EDIT or this one. However, when I want to display the source code (right click --> view source code), I can't see all the content of the actual webpage (just the header and the footer).
I tried to use the rvest
to display the downloadable links but same here, it returns only the ones from the header and the footer:
library(rvest)
html("https://b2share.eudat.eu/records/8d47a255ba5749e3ac169527e22f0068") %>%
html_nodes("a")
Returns:
#{xml_nodeset (5)}
#[1] <a href="https://eudat.eu">Go to EUDAT website</a>
#[2] <a href="https://eudat.eu"><img src="/img/logo_eudat_cdi.svg" alt="EUDAT CDI logo" style="max-width: 200px"></a>
#[3] <a href="https://www.eudat.eu/eudat-cdi-aup">Acceptable Use #Policy </a>
#[4] <a href="https://eudat.eu/privacy-policy-summary">Data Privacy Statement</a>
#[5] <a href="https://eudat.eu/what-eudat">About EUDAT</a>
Any idea how to access the content of the all page?
You need to pass the record id to an API endpoint which provides the parts to construct the file download links as follows:
library(jsonlite)
d <- jsonlite::read_json('https://b2share.eudat.eu/api/records/8d47a255ba5749e3ac169527e22f0068')
files <- paste(d$links$files, d$files[[1]]$key , sep = '/')
For re-use, you can re-write as a function accepting the start link as argument:
library(jsonlite)
library(stringr)
get_links <- function(link){
record_id <- tail(str_split(link, '/')[[1]], 1)
d <- jsonlite::read_json(paste0('https://b2share.eudat.eu/api/records/', record_id))
links <- paste(d$links$files, d$files[[1]]$key , sep = '/')
return(links)
}
get_links('https://b2share.eudat.eu/records/ce32a67a789b44a1a15965fd28a8cb17')
get_links('https://b2share.eudat.eu/records/8d47a255ba5749e3ac169527e22f0068')
Which you could simplify to:
library(jsonlite)
get_links <- function(record_id){
d <- jsonlite::read_json(paste0('https://b2share.eudat.eu/api/records/', record_id))
links <- paste(d$links$files, d$files[[1]]$key , sep = '/')
return(links)
}
get_links('ce32a67a789b44a1a15965fd28a8cb17')
get_links('8d47a255ba5749e3ac169527e22f0068')