Search code examples
rcurlrcurl

How to set the right RCurl options to download from NSE website


I am trying to download files from the NSE India website (nseindia.com). The problem is that webmaster does not like scraping programs downloading files or reading pages from the website. They have a user agent based restriction it seems.

The file I am trying to download is http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip

I am able to download this from the linux shell using

curl -v -A "Mozilla" http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip

The output is this

About to connect() to www.nseindia.com port 80 (#0) * Trying 115.112.4.12... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0connected

GET /archives/equities/bhavcopy/pr/PR280815.zip HTTP/1.1 User-Agent: Mozilla Host: www.nseindia.com Accept: / < HTTP/1.1 200 OK < Server: Oracle-iPlanet-Web-Server/7.0 < Content-Length: 374691 < X-frame-options: SAMEORIGIN < Last-Modified: Fri, 28 Aug 2015 12:20:02 GMT < ETag: "5b7a3-55e051f2" < Accept-Ranges: bytes < Content-Type: application/zip < Date: Sat, 29 Aug 2015 17:56:05 GMT < Connection: keep-alive < { [data not shown] PK 5 365k 5 19977 0 0 34013 0 0:00:11 --:--:-- 0:00:11 56592

This allows me to the download the file.

The code I am using in R Curl is this

  library("RCurl")

  jurl <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
  juseragent <- "Mozilla"
  myOpts = curlOptions(verbose = TRUE, header = TRUE, useragent = juseragent)
  jfile <- getURL(jurl,.opts=myOpts)

This, too, does not work.

I have also unsuccessfully tried using download.file from the base library with the user agent changed.

Any help will be appreciated.


Solution

  • First, your problem is not setting the user agent, but downloading binary data. This works:

    jfile <- getURLContent(jurl, .opts=myOpts, binary=TRUE)
    

    Here is a (more) complete example using httr instead of RCurl.

    library(httr)
    url <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
    response <- GET(url, user_agent("Mozilla"))
    response$status                                          # 200 OK
    # [1] 200
    tf <- tempfile()
    writeBin(content(response, "raw"), tf)                   # write response content (the zip file) to a temporary file
    files <- unzip(tf, exdir=tempdir())                      # unzips to system temp directory and returns a vector of file names
    df.lst <- lapply(files[grepl("\\.csv$",files)],read.csv) # convert .csv files to list of data.frames
    head(df.lst[[2]])
    #      SYMBOL SERIES                  SECURITY HIGH.LOW INDEX.FLAG
    # 1 AGRODUTCH     EQ AGRO DUTCH INDUSTRIES LTD        H         NA
    # 2    ALLSEC     EQ   ALLSEC TECHNOLOGIES LTD        H         NA
    # 3      ALPA     BE     ALPA LABORATORIES LTD        H         NA
    # 4      AMTL     EQ     ADV METERING TECH LTD        H         NA
    # 5  ANIKINDS     BE       ANIK INDUSTRIES LTD        H         NA
    # 6   ARSHIYA     EQ           ARSHIYA LIMITED        H         NA