failed to scrape this website... why?
GET("ü") # Never finishes
GET(URLencode("ü")) # works fine
I tried with other websites that have ü in their URL:
To me it seams like a bug. But I don't know what it is. Am I missing something here?
My Session-Info is:
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] httr_1.0.0
loaded via a namespace (and not attached):
[1] R6_2.1.1 magrittr_1.5 tools_3.2.2 curl_0.9.4 stringi_1.0-1 stringr_1.0.0 XML_3.98-1.3
You can easily rule out R by testing the same URL with the curl
command line utility:
curl -Lvü
This looks like a server side configuration issue. They are running some custom Tomcat/Java web application that keeps redirecting to the same URL:
* Connected to ( port 80 (#0)
> GET /1546/Bü HTTP/1.1
> Host:
> User-Agent: curl/7.43.0
> Accept: */*
< HTTP/1.1 302 Moved Temporarily
< Server: Apache-Coyote/1.1
< Set-Cookie: JSESSIONID=46E977E738A6DBC8BD0EB8084912163F.www1;; Path=/
< Location:ü
< Content-Length: 0
< Date: Wed, 16 Dec 2015 12:17:43 GMT
As you found out yourself, you can work around the problem by escaping the URL, but this should not be needed nowadays.