Search code examples
regexshellcygwinwgethttp-status-code-403

Using regEx to download the entire directory using wget


I want to download multiple pdfs from urls such as this - https://dummy.site.com/aabbcc/xyz/2017/09/15/2194812/O7ca217a71ac444eda516d8f78c29091a.pdf

If I do wget on complete URL then it downloads the file wget https://dummy.site.com/aabbcc/xyz/2017/09/15/2194812/O7ca217a71ac444eda516d8f78c29091a.pdf

But if I try to recursively download the entire folder then it returns 403(forbidden access)

wget -r https://dummy.site.com/aabbcc/xyz/

I have tried by setting user agent, rejecting robots.txt and bunch of other solutions from the internet, but I'm coming back to same point.

So I want to form the list of all possible URLs considering the given URL as common pattern, and have no idea how to do that.

I just know that I can pass that file as input to wget which will download the files recursively. So seeking the help for forming the URL list using regEx here. Thank You!


Solution

  • You can't download using wildcard the files you can't see. If the host do not support directory listing you have no idea what the filenames/paths are. Also as you do not know the algorithm to generate filenames you can't generate and get them.