Search code examples
bashurlcurlwgetforum

WGET saves with wrong file and extension name possibly due to BASH


I`ve tried this on a few forum threads already. However I keep on getting the some failure as a result.

To replicate the problem :

Here is an url leading to a forum thread with 6 pages.

http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1/vc/1

What I typed into the console was :

wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1"

And here is what I got:

      --2018-06-14 10:44:17--  http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/%7B1..6%7D/vc/1
    Resolving forex.kbpauk.ru (forex.kbpauk.ru)... 185.68.152.1
    Connecting to forex.kbpauk.ru (forex.kbpauk.ru)|185.68.152.1|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: unspecified [text/html]
    Saving to: '1'

1                                    [  <=>                                       ]  19.50K  58.7KB/s    in 0.3s

2018-06-14 10:44:17 (58.7 KB/s) - '1' saved [19970]

The file was saved as simply "1" with no extension as it seems.

My expectations were that the file will be saved with an .html extension, because its a webpage.

Im trying to get WGET to work, but if its possible to do what I want with CURL than I would also accept that as an answer.


Solution

  • Well, there's a couple of issues with what you're trying to do.

    1. The double quotes around your URL actually prevent Bash expansion, so you're not really downloading 6 files, but a single URL with "{1..6}" in it. You probably want to not have quotes around the URL to allow bash to expand it into 6 different parameters.

    2. I notice that all of the pages are called "1", irrespective of their actual page numbers. This means the server is always serving a page with the same name, making it very hard for Wget or any other tool to actually make a copy of the webpage.

    The real way to create a mirror of the forum would be to use this command line:

    $ wget -m --no-parent -k --adjust-extension http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1

    Let me explain what this command does:

    -m --mirror activates the mirror mode (recursion)
    --no-parent asks Wget to not go above the directory it starts from
    -k --convert-links will edit the HTML pages you download so that the links in them will point to the other local pages you have also downloaded. This allows you to browse the forum pages locally without needing to be online
    --adjust-extension This is the option you were originally looking for. It will cause Wget to save the file with a .html extension if it downloads a text/html file but the server did not provide an extension.