Search code examples
rpostweb-scrapinghttrrselenium

Using R to mimic “clicking” a download file button on a webpage


There are 2 parts of my questions as I explored 2 methods in this exercise, however I succeed in none. Greatly appreciated if someone can help me out.

[PART 1:]

I am attempting to scrape data from a webpage on Singapore Stock Exchange https://www2.sgx.com/derivatives/negotiated-large-trade containing data stored in a table. I have some basic knowledge of scraping data using (rvest). However, using Inspector on chrome, the html hierarchy is much complex then I expected. I'm able to see that the data I want is hidden under < div class= "table-container" >,and here's what I've tied:

library(rvest)
library(httr)
library(XML)
SGXurl <- "https://www2.sgx.com/derivatives/negotiated-large-trade"
SGXdata <- read_html(SGXurl, stringsASfactors = FALSE)
html_nodes(SGXdata,".table-container")

However, nothing has been picked up by the code and I'm doubt if I'm using these code correctly.

[PART 2:]

As I realize that there's a small "download" button on the page which can download exactly the data file i want in .csv format. So i was thinking to write some code to mimic the download button and I found this question Using R to "click" a download file button on a webpage, but i'm unable to get it to work with some modifications to that code.

There's a few filtera on the webpage, mostly I will be interested downloading data for a particular business day while leave other filters blank, so i've try writing the following function:

library(httr)
library(rvest)
library(purrr)
library(dplyr)

crawlSGXdata = function(date){

POST("https://www2.sgx.com/derivatives/negotiated-large-trade", 
     body = NULL
     encode = "form",
     write_disk("SGXdata.csv")) -> resfile
res = read.csv(resfile)
return(res) 
}

I was intended to put the function input "date" into the “body” argument, however i was unable to figure out how to do that, so I started off with "body = NULL" by assuming it doesn't do any filtering. However, the result is still unsatisfactory. The file download is basically empty with the following error:

Request Rejected
The requested URL was rejected. Please consult with your administrator.
Your support ID is: 16783946804070790400

Solution

  • The content is loaded dynamically from an API call returning json. You can find this in the network tab via dev tools.

    The following returns that content. I find the total number of pages of results and loop combining the dataframe returned from each call into one final dataframe containing all results.

    library(jsonlite)
    
    url <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=0&pageSize=250'
    r <-  jsonlite::fromJSON(url)
    num_pages <- r$meta$totalPages
    df <- r$data
    url2 <- 'https://api.sgx.com/negotiatedlargetrades/v1.0?order=asc&orderby=contractcode&category=futures&businessdatestart=20190708&businessdateend=20190708&pagestart=placeholder&pageSize=250'
    
    if(num_pages > 1){
      for(i in seq(1, num_pages)){
        newUrl <- gsub("placeholder", i , url2)
        newdf <- jsonlite::fromJSON(newUrl)$data
        df <- rbind(df, newdf)
      }
    }