I have a link like this one that I would like to extract data from it using RCurl
, there is a disclaimer page before that and I need to click it in my browser before I can assess the data. Previously I use the script below, which is from here, to "bypass" disclaimer page and access the data using RCurl
:
pagesource <- getURL(url,.opts=curlOptions(followlocation=TRUE,cookiefile="nosuchfile"))
doc <- htmlParse(pagesource)
It works before, but in recent few days it no long works. Actually I don't have much idea on what the code it doing, I wonder if I have to change something in the curlOptions
, or re-write the whole piece of code?
Thanks.
As I mention in my comment, the solution to your problem will totally depend on the implementation of the "disclaimer page." It looks like the previous solution used cURL options defined in more detail here. Basically, what it's instructing cURL to do is to provide a fake cookies file (named "nosuchfile") and then followed the header redirect given by the site you were trying to access. Apparently that site was setup in such a way that if a visitor claimed not to have the proper cookies, then it would immediately redirect the visitor past the disclaimer page.
You didn't happen to create a file named "nosuchfile" in your working directory, did you? If not, it sounds like the target site changed the way its disclaimer page operates. If that's the case, there's really no help we can provide unless we have the actual page you're trying to access to diagnose.
In the example you reference in your question, they're using Javascript to move past the disclaimer, which could be tricky to get past.
For the example you mention, however...
You can access that URL directly without having to accept any license agreement, either by hand or from cURL.
Note that if you've already accepted the agreement, this site stores a cookie stating such which will need to be deleted in order to get back to the license agreement page. You can do this by clicking the "Resources" tab, then going to "Cookies" and deleting each one, then refreshing the URL you posted above.