Scraping document inside in R

I am trying to download one Word document from the below web page. When you press the button, the Word document will be downloaded automatically without showing any link for downloading.

Now I am trying with coping of XPath, to download this document inside the R.

library(rvest)

# send an HTTP GET request to the URL
url <- "https://ec.europa.eu/taxation_customs/tedb/taxDetails.html?id=4205/1672527600"
page <- read_html(url)

# locate the link to the Word document using CSS selector
doc_link <- page %>%
  html_nodes(xpath='//*[@id="action_word_export"]')%>%
  html_attr("href")

But unfortunately, this does not work, and nothing can be downloaded. So can anybody help how to solve this problem and download a Word document inside in R environment?

Solution

The problem is that the button triggers a javascript script that actually sends the download request, so there's not an href attribute associated directly with the button. If you're open to using RSelenium, here's a way to download the file:

# load libraries
library(RSelenium)


# define target url
url <- "https://ec.europa.eu/taxation_customs/tedb/taxDetails.html?id=4205/1672527600"


# start RSelenium ------------------------------------------------------------

rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
remDr <- rD[["client"]]

# open the remote driver-------------------------------------------------------
remDr$open()

# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)


# click on the download button ------------------------------------
remDr$findElement(using = "xpath",value = '//*[@id="action_word_export"]')$clickElement()

The file should download to your default downloads folder.

It's also possible that their download links are in a standard format. You can see what url address the javascript script points to using the web developer tools:

If you paste that bit to the main url you end up with a link that also downloads the file

download_link <- paste0("https://ec.europa.eu/taxation_customs/tedb/",
                        "exportTax.html?taxId=4205&taxVersionDate=1672527600")

https://ec.europa.eu/taxation_customs/tedb/exportTax.html?taxId=4205&taxVersionDate=1672527600

There might be a pattern that would allow you to paste together your search criteria to generate download links instead of using RSelenium