I have a problem when scraping a website with the getURL()
function from RCurl
. For example with http://dogecoin.com it returns an error saying that a NULL character is in the middle of the chain (litteral translation.
> x <- getURL("http://dogecoin.com/")
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
caractère nul au milieu de la chaîne : '\037\x8b\b\0\0\0\0\0\0\003\xed]\xebr\xdbF\x96\xfe\035=E\x9b\xa9\x89\xe4]\x82\xd4͗8\x92R\xb2|\x9d\x91-\x97\xa5\xac7\x95IiA\002$a\x81\0\x82\x8bhf2ﰯ\xb1\xaf\xb1\xfbb\xfb}\xa7\033WB\022E{\xa62\025\xa5*2I\xf4\xbdO\x9f\xcbw\xcei\xec\xdd{vrt\xf6\xe3\xbb\xe7j\x92N\xfd\x83\xb5\xbd\xfc\037\xd7v\016\xd6\024\xfeۻgY\xf2\xc1\xfcw~^\xf9\xb2\xf0\xf1\xdc\024=\xc7\177}\xd5\xc7_\xa5\xf8Y\xff\x91\xbf\xf2\x9fR\033\xe7\xf7\xf1\xaf.\xde\xc7\003\xa5\xe4\xef_\xe5\xef\177\xe1\xaf\xfe\x88V\xf4\xaf\xfa_)φ\xf1\177\xed/JI\177ů|\xae\xfaR\xfe\xaf\xe7\xe7\xdd\xf3>\xfe\xe21?+\xf9\\\xfe9Gs\xbabu\xa2ғ\xd4Y\x93\x9f\x93PM\xed\xf8"\x8bz\xeaҍ\xe7j\xe6\016\022/u{j\026\xcezR²\016\xd6־₩7nj\xcb\xf7\xaf\xf6R/\xf5݃g\xe1\xd8\035\x86^\xb0\xd7\xd7\xdf\xf1`\xca2É\035'n\xba\xdf\xc9ґ\xf5\xb8\xc3\n\xfa\xf70H\xdd\0\xbf\xe7\025\x95\x97(;Pa\xe4\006\030J\026\017]\025\xb9nl\xa5\xa1\xc5\177\x95㍽\xd4\xf6\xd50\x8bc7\030λjd_\x86\xb1\xeb\xa8\xc1\\\x9dN\xbc\x81\xad^\aY\x82\xd1
On some very rare occasions it returns a clean HTML code but most of the time I have this error. It seems related to their website and as you can see there are several weird characters like ₩ and 4͗.
An option could be to use getURLcontent()
to download the raw data but then I'm unable to convert the binary content into HTML.
I have try to change the .encoding
argument but it doesn't give the expected result. How could I scrape this webpage ?
EDIT : Verbose mode
> getURL("http://dogecoin.com/", verbose = TRUE)
* Trying 192.30.252.153...
* Connected to dogecoin.com (192.30.252.153) port 80 (#0)
> GET / HTTP/1.1
Host: dogecoin.com
Accept: */*
< HTTP/1.1 200 OK
< Server: GitHub.com
< Date: Wed, 25 Oct 2017 10:12:26 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Last-Modified: Tue, 16 May 2017 01:27:52 GMT
< Access-Control-Allow-Origin: *
< Expires: Wed, 25 Oct 2017 10:05:08 GMT
< Cache-Control: max-age=600
< Content-Encoding: gzip
< X-GitHub-Request-Id: A4D0:66A8:93356A1:D740FF7:59F0638A
<
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
caractère nul au milieu de la chaîne : '\037\x8b\b\0\0\0\0\0\0\003\xed]\xebr\xdbF\x96\xfe\035=E\x9b\xa9\x89\xe4]\x82\xd4͗8\x92R\xb2|\x9d\x91-\x97\xa5\xac7\x95IiA\002$a\x81\0\x82\x8bhf2ﰯ\xb1\xaf\xb1\xfbb\xfb}\xa7\033WB\022E{\xa62\025\xa5*2I\xf4\xbdO\x9f\xcbw\xcei\xec\xdd{vrt\xf6\xe3\xbb\xe7j\x92N\xfd\x83\xb5\xbd\xfc\037\xd7v\016\xd6\024\xfeۻgY\xf2\xc1\xfcw~^\xf9\xb2\xf0\xf1\xdc\024=\xc7\177}\xd5\xc7_\xa5\xf8Y\xff\x91\xbf\xf2\x9fR\033\xe7\xf7\xf1\xaf.\xde\xc7\003\xa5\xe4\xef_\xe5\xef\177\xe1\xaf\xfe\x88V\xf4\xaf\xfa_)φ\xf1\177\xed/JI\177ů|\xae\xfaR\xfe\xaf\xe7\xe7\xdd\xf3>\xfe\xe21?+\xf9\\\xfe9Gs\xbabu\xa2ғ\xd4Y\x93\x9f\x93PM\xed\xf8"\x8bz\xeaҍ\xe7j\xe6\016\022/u{j\026\xcezR²\016\xd6־₩7nj\xcb\xf7\xaf\xf6R/\xf5݃g\xe1\xd8\035\x86^\xb0\xd7\xd7\xdf\xf1`\xca2É\035'n\xba\xdf\xc9ґ\xf5\xb8\xc3\n\xfa\xf70H\xdd\0\xbf\xe7\025\x95\x97(;Pa\xe4\006\030J\026\017]\025\xb9nl\xa5\xa1\xc5\177\x95㍽\xd4\xf6\xd50\x8bc7\030λjd_\x86\xb1\xeb\xa8\xc1\\\x9dN\xbc\x81\xad^\aY\x82\xd1
>
RCurl::getURL()
seems to not be detecting either the Content-Encoding: gzip
header nor the tell-tale first two byte "magic" code that also signals the content is gzip encoded.
I would suggest — as Michael did — switching to httr
for reasons I'll go into in a bit, but this wld be a better httr
idiom:
library(httr)
res <- GET("http://dogecoin.com/")
content(res)
The content()
function extracts the raw response and returns an xml2
object which is similar to the XML
library parsed object that you would likely have been using given the use of RCurl::getURL()
.
The alternative way is to add some crutches to RCurl::getURL()
:
html_text_res <- RCurl::getURL("http://dogecoin.com/", encoding="gzip")
Here, we're explicitly informing getURL()
that the content is gzip'd, but that's fraught with peril in that if the upstream server decides to use, say, brotli encoding, then you'll get an error.
If you still want to use RCurl
vs switch to httr
I'd suggest doing the following for this site:
RCurl::getURL("http://dogecoin.com/",
encoding = "gzip",
httpheader = c(`Accept-Encoding` = "gzip"))
Here' were giving getURL()
the decoding crutch but also explicitly telling the upstream server that gzip is 👍 and that it should send data with that encoding.
However, httr
would be a better choice since it and the curl
package it uses deal with web server interaction and content in a more thorough way.