Search code examples
rweb-scrapinghttr

GET never finishes for url with Umlaut ü


GET failed to scrape this website... why?

require(httr)
GET("http://www.atelco.de/1546/Bügeln.search") # Never finishes
GET(URLencode("http://www.atelco.de/1546/Bügeln.search")) # works fine

I tried with other websites that have ü in their URL:

GET("http://www.bosch-home.com/de/produkte/bügeln.html")

To me it seams like a bug. But I don't know what it is. Am I missing something here?

My Session-Info is:

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.1 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] httr_1.0.0

loaded via a namespace (and not attached):
[1] R6_2.1.1      magrittr_1.5  tools_3.2.2   curl_0.9.4    stringi_1.0-1 stringr_1.0.0 XML_3.98-1.3 

Solution

  • You can easily rule out R by testing the same URL with the curl command line utility:

    curl -Lv http://www.atelco.de/1546/Bügeln.search
    

    This looks like a server side configuration issue. They are running some custom Tomcat/Java web application that keeps redirecting to the same URL:

    * Connected to www.atelco.de (81.7.220.137) port 80 (#0)
    > GET /1546/Bügeln.search HTTP/1.1
    > Host: www.atelco.de
    > User-Agent: curl/7.43.0
    > Accept: */*
    >
    < HTTP/1.1 302 Moved Temporarily
    < Server: Apache-Coyote/1.1
    < Set-Cookie: JSESSIONID=46E977E738A6DBC8BD0EB8084912163F.www1; Domain=.atelco.de; Path=/
    < Location: http://www.atelco.de/1546/Bügeln.search
    < Content-Length: 0
    < Date: Wed, 16 Dec 2015 12:17:43 GMT
    

    As you found out yourself, you can work around the problem by escaping the URL, but this should not be needed nowadays.