Search code examples
curlmediawikiwgetwikipediawikimedia

Problems with `curl` on WikiMedia sites


I am unable to download any content from WikiMedia sites like wikipedia and wikiquote using curl.

When I try I get:

~$ /usr/bin/curl -v "http://en.wikipedia.org/wiki/Celsius"
*   Trying 2620:0:863:ed1a::1...
* TCP_NODELAY set
* Connected to en.wikipedia.org (2620:0:863:ed1a::1) port 80 (#0)
> GET /wiki/Celsius HTTP/1.1
> Host: en.wikipedia.org
> User-Agent: curl/7.52.1
> Accept: */*
> 
< HTTP/1.1 301 Moved Permanently
< Date: Fri, 19 May 2017 22:09:49 GMT
< Server: Varnish
< X-Varnish: 350654144
< X-Cache: cp4017 int
< X-Cache-Status: int
< Set-Cookie: WMF-Last-Access=19-May-2017;Path=/;HttpOnly;secure;Expires=Tue, 20 Jun 2017 12:00:00 GMT
< Set-Cookie: WMF-Last-Access-Global=19-May-2017;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Tue, 20 Jun 2017 12:00:00 GMT
< X-Client-IP: 2605:a601:1127:7d00:35a2:5040:e002:9949
< Location: https://en.wikipedia.org/wiki/Celsius
< Content-Length: 0
< Connection: keep-alive
< 
* Curl_http_done: called premature == 0
* Connection #0 to host en.wikipedia.org left intact

and no actual content. The same url downloads fine with wget. I am also able to download other websites with curl. It is onlu the combination of curl and WikiMedia sites (wikipedia, wikiquote, ...) that is causing this.

I am using Ununtu-MATE 17.04. My curl version is:

/usr/bin/curl --version
curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 OpenSSL/1.0.2g zlib/1.2.11 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) librtmp/2.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP UnixSockets HTTPS-proxy PSL 

Any ideas what the problem might be?


Solution

  • In chrome and maybe other browsers you have an option to get the request as curl.

    Start developer tools, refresh the page and right click on the first link under network tab. Right click on it, then "copy", then "copy as curl".

    Example that works:

    curl 'https://en.wikipedia.org/wiki/Celsius' -H 'pragma: no-cache' -H 'dnt: 1' -H 'accept-encoding: gzip, deflate, sdch, br' -H 'accept-language: en-US,en;q=0.8,ro;q=0.6,la;q=0.4' -H 'upgrade-insecure-requests: 1' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/531.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/511.36' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'cache-control: no-cache' -H 'authority: en.wikipedia.org' -H 'WMF-Last-Access=19-May-2017; WMF-Last-Access-Global=19-May-2017' --compressed
    

    Curl version:

    curl 7.51.0
    

    Reason why your command doesn't work is that cURL needs to be instructed to follow redirects (you'll also notice the 302 in the sample you provided):

    curl -L http://en.wikipedia.org/wiki/Celsius