Search code examples
rweb-scrapingrvest

Scraping document inside in R


I am trying to download one Word document from the below web page. When you press the button, the Word document will be downloaded automatically without showing any link for downloading.

enter image description here

Now I am trying with coping of XPath, to download this document inside the R.

library(rvest)

# send an HTTP GET request to the URL
url <- "https://ec.europa.eu/taxation_customs/tedb/taxDetails.html?id=4205/1672527600"
page <- read_html(url)

# locate the link to the Word document using CSS selector
doc_link <- page %>%
  html_nodes(xpath='//*[@id="action_word_export"]')%>%
  html_attr("href")

But unfortunately, this does not work, and nothing can be downloaded. So can anybody help how to solve this problem and download a Word document inside in R environment?


Solution

  • The problem is that the button triggers a javascript script that actually sends the download request, so there's not an href attribute associated directly with the button. If you're open to using RSelenium, here's a way to download the file:

    # load libraries
    library(RSelenium)
    
    
    # define target url
    url <- "https://ec.europa.eu/taxation_customs/tedb/taxDetails.html?id=4205/1672527600"
    
    
    # start RSelenium ------------------------------------------------------------
    
    rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
    remDr <- rD[["client"]]
    
    # open the remote driver-------------------------------------------------------
    remDr$open()
    
    # Navigate to webpage -----------------------------------------------------
    remDr$navigate(url)
    
    
    # click on the download button ------------------------------------
    remDr$findElement(using = "xpath",value = '//*[@id="action_word_export"]')$clickElement()
    
    
    

    The file should download to your default downloads folder.

    It's also possible that their download links are in a standard format. You can see what url address the javascript script points to using the web developer tools: javascript script's target url for download

    If you paste that bit to the main url you end up with a link that also downloads the file

    download_link <- paste0("https://ec.europa.eu/taxation_customs/tedb/",
                            "exportTax.html?taxId=4205&taxVersionDate=1672527600")
                             
    

    https://ec.europa.eu/taxation_customs/tedb/exportTax.html?taxId=4205&taxVersionDate=1672527600

    There might be a pattern that would allow you to paste together your search criteria to generate download links instead of using RSelenium