I am trying to download files from the NSE India website (nseindia.com). The problem is that webmaster does not like scraping programs downloading files or reading pages from the website. They have a user agent based restriction it seems.
The file I am trying to download is http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip
I am able to download this from the linux shell using
curl -v -A "Mozilla" http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip
The output is this
About to connect() to www.nseindia.com port 80 (#0) * Trying 115.112.4.12... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0connected
GET /archives/equities/bhavcopy/pr/PR280815.zip HTTP/1.1 User-Agent: Mozilla Host: www.nseindia.com Accept: / < HTTP/1.1 200 OK < Server: Oracle-iPlanet-Web-Server/7.0 < Content-Length: 374691 < X-frame-options: SAMEORIGIN < Last-Modified: Fri, 28 Aug 2015 12:20:02 GMT < ETag: "5b7a3-55e051f2" < Accept-Ranges: bytes < Content-Type: application/zip < Date: Sat, 29 Aug 2015 17:56:05 GMT < Connection: keep-alive < { [data not shown] PK 5 365k 5 19977 0 0 34013 0 0:00:11 --:--:-- 0:00:11 56592
This allows me to the download the file.
The code I am using in R Curl is this
library("RCurl")
jurl <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
juseragent <- "Mozilla"
myOpts = curlOptions(verbose = TRUE, header = TRUE, useragent = juseragent)
jfile <- getURL(jurl,.opts=myOpts)
This, too, does not work.
I have also unsuccessfully tried using download.file from the base library with the user agent changed.
Any help will be appreciated.
First, your problem is not setting the user agent, but downloading binary data. This works:
jfile <- getURLContent(jurl, .opts=myOpts, binary=TRUE)
Here is a (more) complete example using httr
instead of RCurl
.
library(httr)
url <- "http://www.nseindia.com/archives/equities/bhavcopy/pr/PR280815.zip"
response <- GET(url, user_agent("Mozilla"))
response$status # 200 OK
# [1] 200
tf <- tempfile()
writeBin(content(response, "raw"), tf) # write response content (the zip file) to a temporary file
files <- unzip(tf, exdir=tempdir()) # unzips to system temp directory and returns a vector of file names
df.lst <- lapply(files[grepl("\\.csv$",files)],read.csv) # convert .csv files to list of data.frames
head(df.lst[[2]])
# SYMBOL SERIES SECURITY HIGH.LOW INDEX.FLAG
# 1 AGRODUTCH EQ AGRO DUTCH INDUSTRIES LTD H NA
# 2 ALLSEC EQ ALLSEC TECHNOLOGIES LTD H NA
# 3 ALPA BE ALPA LABORATORIES LTD H NA
# 4 AMTL EQ ADV METERING TECH LTD H NA
# 5 ANIKINDS BE ANIK INDUSTRIES LTD H NA
# 6 ARSHIYA EQ ARSHIYA LIMITED H NA