Search code examples
rdockerseleniummime-typesrselenium

Downloading data using RSelenium & Docker containers (makeFirefoxProfile & mime types)


I need to download 50+ datasets each week from a dynamic website, and I would like to automate the process in R. Each dataset comes from a different school, each of which has its own link. The code for the each school's website is basically identical.

I set up my docker container using:

docker run -d -p 4446:4444 -p 5902:5900 -v /C/Users/myusername/seldownloads:/home/seluser/Downloads selenium/standalone-firefox-debug

In R, I set up my session:

fprof <- makeFirefoxProfile(list(browser.download.dir = "home/seluser/Downloads",
                                 browser.download.folderList = 2L,
                                 browser.download.manager.showWhenStarting = FALSE,
                                 browser.helperApps.neverAsk.saveToDisk = "text/csv,application/vnd.ms-excel,application/vnd.ms-excel.addin.macroenabled.12,application/vnd.ms-excelsheet.binary.macroenabled.12,application/vnd.ms-excel.template.macroenabled.12,application/vnd.ms-excel.sheet.macroenabled.12,image/png,application/zip,application/pdf"))

#Start session
remDr <- remoteDriver(remoteServerAdd = "localhost",
                      browser = "firefox",
                      port = 4446L,
                      extraCapabilities = fprof)

remDr$open()

The html for School 1's website looks like this:

<td class=bodytext><input class="btn btn-primary" name="exportData" type="submit" id="exportData" value="Export Data"></td>

And I successfully downloaded the csv file (Dataset A) from the School 1 website using:

exportdata <- remDr$findElement(using="name", value="exportData")
exportdata$clickElement()

The html for School 2's website looks like this:

<td class=bodytext><input name="exportData" class="btn btn-primary" type="submit" id="exportData" value="Export Data"></td>

But when I run the R code, Dataset A from School 2 doesn't appear on my computer.

I actually can't get anything else to download from this website except that Dataset A from School 1. I can't even get Dataset B to download from School 1. I've tried restarting docker, creating a new docker session, restarting my computer... the only csv file that will download is the Dataset A from School 1.

Is there some limitation of RSelenium that it can only download the first link you ever click on? I'm at a loss. I can't link the website because it requires a login.


Solution

  • It turned out that the MIME types were different on each website. For School 1, Dataset A, the file I was trying to download was a standard csv file (text/csv). The other schools/datasets were all application/x-csv MIME types.

    To find the MIME type of the file I was attempting to download, I followed these steps: https://developer.mozilla.org/en-US/docs/Learn/Server-side/Configuring_server_MIME_types.

    There is also a known bug when it comes to specifying the file location. My final code looked like this:

    file_path <- getwd() %>% str_replace_all("/", "\\\\\\\\")
    
    #Set download info for remoteDriver (aka, where to save datasets)
    fprof <- makeFirefoxProfile(list(browser.download.dir = file_path,
                                     browser.download.folderList = 2L,
                                     browser.download.manager.showWhenStarting = FALSE,
                                     browser.helperApps.neverAsk.saveToDisk = "application/x-csv,attachment/csv,application/excel,text/csv,application/vnd.ms-excel,application/vnd.ms-excel.addin.macroenabled.12,application/vnd.ms-excelsheet.binary.macroenabled.12,application/vnd.ms-excel.template.macroenabled.12,application/vnd.ms-excel.sheet.macroenabled.12,image/png,application/zip,application/pdf"))
    

    Note that I added application/x-csv,attachment/csv to the browser.helperApps.neverAsk.saveToDisk list.

    Important: It is worth adding application/csv to that list. I did not do it here, but had to do it when downloading another file of MIME type text/html; charset=UTF-8 to get it to download.

    I also ditched Docker by downloading Java and then switching to rsDriver() so that I could watch R click through the browser instead of screenshotting each step (this not necessary).

    Lastly, I believe that you should not have the same file types for browser.helperApps.neverAsk.openFile and browser.helperApps.neverAsk.saveToDisk, because they contradict each other. Since I wanted it to automatically save the file, I needed to only include browser.helperApps.neverAsk.saveToDisk.