Search code examples
bashcygwinwget

Using Wget with buggy URL


I've got the following link, which is downloading a CSV file when put through a web browser.

http://pro.allocine.fr/film/export_classement.html?typeaffichage=2&lsttype=1001&lsttypeperiode=3002&typedonnees=visites&cfilm=&datefiltre=

However, when using Wget with Cygwin, with the command below, Wget retrieves a file, which is not a CSV file, but a file without extension. The file is empty, that is, has no data at all.

wget 'http://pro.allocine.fr/film/export_classement.html?typeaffichage=2&lsttype=1001&lsttypeperiode=3002&typedonnees=visites&cfilm=&datefiltre='

So as I hate to be stuck, I tried the following as well. I put the URL in a text file and used Wget with the file option:

inside fic.txt

'http://pro.allocine.fr/film/export_classement.html?typeaffichage=2&lsttype=1001&lsttypeperiode=3002&typedonnees=visites&cfilm=&datefiltre='

I used Wget in the following way:

wget -i fic.txt

I got the following errors:

 Scheme missing
 No URLs found in toto.txt

Solution

  • I think I can suggest some other options that will make your underlying problem more clear which is that it's supposed to be html, but there is no content (content-length = 0).

    More concretely, this

    wget -S -O export_classement.html 'http://pro.allocine.fr/film/export_classement.html?typeaffichage=2&lsttype=1001&lsttypeperiode=3002&typedonnees=visites&cfilm=&datefiltre='
    

    produces this

    Resolving pro.allocine.fr... 62.39.143.50
    Connecting to pro.allocine.fr|62.39.143.50|:80... connected.
    HTTP request sent, awaiting response... 
      HTTP/1.1 200 OK
      Server: nginx
      Date: Fri, 28 Mar 2014 09:54:44 GMT
      Content-Type: text/html; Charset=iso-8859-1
      Connection: close
      X-ServerName: WEBNX2
      akamainocache: no-store
      Content-Length: 0
      Cache-control: private
      X-KompressorName: kompressor7
    Length: 0 [text/html]
    
    2014-03-28 05:54:52 (0.00 B/s) - ‘export_classement.html’ saved [0/0]
    

    Additionally the server is tailoring it's output based on how the browser identifies itself. using wget does have an option to include an arbitrary user-agent in the headers. Here's an example what happens when you make wget identify itself as Chrome. Here's a list of other possibiities.

    wget -S --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" 'http://pro.allocine.fr/film/export_classement.html?typeaffichage=2&lsttype=1001‌​&lsttypeperiode=3002&typedonnees=visites&cfilm=&datefiltre='
    

    Now the output changes to export.csv, with type "application/octet-stream" instead of "text/html"

    HTTP request sent, awaiting response... 
     HTTP/1.1 200 OK
     Server: nginx
     Date: Fri, 28 Mar 2014 10:34:09 GMT
     Content-Type: application/octet-stream; Charset=iso-8859-1
     Transfer-Encoding: chunked
     Connection: close
     X-ServerName: WEBNX2
     Edge-Control: no-store
     Last-Modified: Fri, 28 Mar 2014 10:34:17 GMT
     Content-Disposition: attachment; filename=export.csv