Search code examples
pythonpython-3.xdjangourllibwget

wget Python package downloads/saves XML without issue, but not text or html files


Have been using this basic code to download and store updated sitemaps from a hosting/crawling service, and it works fine for all the XML files. However, the text and HTML files appear to be in the wrong encoding, but when I force them all to a single encoding (UTF-8) there is no change and the files are still unreadable (screenshots attached). No matter which encoding is used, the TXT and HTML files are unreadable, but the XML files are fine.

I'm using Python 3.10, Django 3.0.9, and the latest wget python package available (3.2) on Windows 11. I've also tried using urllib and other packages with the same results.

The code:

sitemaps = ["https://.../sitemap.xml",
        "https://.../sitemap_images.xml",
        "https://.../sitemap_video.xml",
        "https://.../sitemap_mobile.xml",
        "https://.../sitemap.html",
        "https://.../urllist.txt",
        "https://.../ror.xml"]

def download_and_save(url):
    save_dir = settings.STATICFILES_DIRS[0]
    filename = url.split("/")[-1]
    full_path = os.path.join(save_dir, filename)
    if os.path.exists(full_path):
        os.remove(full_path)
    wget.download(url, full_path)

for url in sitemaps:
    download_and_save(url)

For all of the XML files, I get this (which is the correct result): image of downloaded xml file using WGET

For the urllist.txt and sitemap.html files, however, this is the result:

screenshot of HTML encoding from wget

I'm not sure why the XML files save fine, but the encoding is messed up for text (.txt) and html files only.


Solution

  • After speaking with the sitemap hosting provider (pro-sitemaps.net) it appears that the problem was on their end. The HTML and TXT files I was downloading were being served with the wrong encoding (or something similar to that). Though these files were visible/accessible in the browser from the direct URLs at their service, they were not being served via wget with the right encoding it appears.

    I submitted a ticket to the provider and the issue was resolved within 12 hours (though I didn't get confirmation of the exact issue that caused my problem here). I have now verified that the TXT and HTML files are being served by them in the correct encoding/format via wget.