Search code examples
linuxutf-8character-encodingwgetansi

Offensive-security files content is unreadable with wget


I'm trying to download some URLs using wget. I get files with no problem except for this link Offensive-Security-ICQ and any other link on www.offensive-security.com.

I tried on both Linux and Windows with many trials and alot of search, but in vain.

I use this command "wget https://www.offensive-security.com/pwbonline/icq.html"

The resulted file shows this symbols and it is ANSI decoded enter image description here

How can I solve this problem??


Solution

  • For some reason, the server does not return the html page but a zipped version of it. The file you get is identified as a gzip compressed data:

    $ file icq.html
    icq.html: gzip compressed data, from Unix
    

    So you can simply unzip it and you get the correct html page.

    Why is the server doing that: not sure, but it's probably some default setting that has been left as is, so you can download faster.

    How can one directly donwload the html content: probably by sending some common user agent and header, so that the server thinks that its a common web browser doing the request instead of a download tool.

    This can be done with wget using some options, for example, this should work:

    wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" https://www.offensive-security.com/pwbonline/icq.html