I've been trying to do some web scraping using R, and in several pages it has been relatively easy. But I've been struggling for weeks with one particular web page:
The problem, I think, lies in the fact that in the end, the page loads the data using javascript.
At first I thought it was a very simple case; after all, it is just a link that you put in the browser to see the data, so I thought ok, it is a good-old http get request and I naively tried something like this:
library(httr)
url <- "https://www.commerzbank.de/de/hauptnavigation/kunden/kursinfo/devisenk/weitere_waehrungen___indikative_kurse/indikative_kurse.jsp"
res1 <- GET(url = url)
As it didn't work, I checked how the web page works and it is as follows. First, it sets some cookies and a couple of parameters and then redirects the browser (by means of a http POST request) to the url https://www.commerzbank.de/rates/do.rates. This new page loads a huge javascript code (1923 lines of code, as formatted by http://jsbeautifier.org/) that is responsible for downloading the data and generating the html code to display it. This code uses the cookies and parameters set by the original page to determine what data to download and display.
I've tried too many things in R to get the data in this web page. I won't put in here all the crazy stuff I tried because it would be too long (and sometimes embarrassing), but I have tried playing with most functions of RCurl and other packages (repmis, scrapeR, httr, rjson, among others). Nothing seems to work because none of these packages seem to have a way to (at least automatically) make the javascript code to run to download the data.
Is there any package/hidden function that would help me accomplish this?
Thanks in advance.
Assuming that you want to scrape the data of the table in the middle of the page, here is a solution using RSelenium
.
library(RSelenium)
library(magrittr)
base_url = "https://www.commerzbank.de/de/hauptnavigation/kunden/kursinfo/devisenk/weitere_waehrungen___indikative_kurse/indikative_kurse.jsp"
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
remDrv$navigate(base_url)
remDrv$getPageSource()[[1]] %>% htmlParse %>%
readHTMLTable(header = TRUE) %>%
extract2(1) %>% head
# ISO Land Mittelkurs Geld Brief
# 1 AFN Afghanistan 66,6600 65,6600 67,6600
# 2 ALL Albanien 140,2300 137,7300 142,7300
# 3 AMD Armenien 553,6000 523,6000 583,6000
# 4 ANG Curaçao, St. Martin (südl. Teil) 2,0392 1,9892 2,0892
# 5 AOA Angola 119,7755 116,7755 122,7755
# 6 ARS Argentinien 9,9598 9,8798 10,0398
RSelenium even supports headless browsing leveraging PhantomJS as described in this vignette.