Search code examples
wget

How to wget a file without getting the html instead?


I'm trying to download a file using:

wget https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt

I'm expecting to get the .txt file, however, I get the page html instead.

I tried wget --max-redirect=2 --trust-server-names <url> based on the suggestions here and wget -m <url> which downloads the entire website, and a few other variations that also don't work.


Solution

  • wget https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
    

    This point wget to HTML page even though it has .txt suffix. After visting it I found there is link to text file itself under raw, which you should be able to use with wget following way

    wget https://huggingface.co/distilbert-base-uncased/raw/main/vocab.txt
    

    If you need to reveal true type of file without downloading it you might use --spider option, in this case

    wget --spider https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
    

    gives output containing

    Length: 7889527 (7,5M) [text/html]
    

    and

    wget --spider https://huggingface.co/distilbert-base-uncased/raw/main/vocab.txt
    

    gives output containing

    Length: 231508 (226K) [text/plain]