Search code examples
pythondirectorywgetfilenotfounderror

wget python .tmp error doesn't work on specific web URL(web crawling )


Hello I have a weird problem in Python using wget, will be so grateful if someone could give me a help.

what I want to do :

download the file('.pdf','.djvu') from specific website(ex. wiki) with wget, Python. which should be easy.

the full target page

specific page I'm trying to do web crawl

getting the file link for wget

Problem :

it's really weird. At most pages in website, it works well.

But some pages with same HTML structure, it doesn't work.

Even in the same page, some files downloads well with wget but some doesn't

and getting this error message


Error message :

`C:\start_automation\crawling_job>C:/Users/sa031/AppData/Local/Programs/Python/Python311/python.exe c:/start_automation/crawling_job/download_test.py
Traceback (most recent call last):
  File "c:\start_automation\crawling_job\download_test.py", line 39, in <module>
    wget.download(url)
  File "C:\Users\sa031\AppData\Local\Programs\Python\Python311\Lib\site-packages\wget.py", line 303, in download
    (fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".")
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\sa031\AppData\Local\Programs\Python\Python311\Lib\tempfile.py", line 341, in mkstemp
    return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\sa031\AppData\Local\Programs\Python\Python311\Lib\tempfile.py", line 256, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\start_automation\\crawling_job\\CADAL06210101_%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E6%98%93%E5%9C%96%E6%A2%9D%E8%BE%AE%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E8%99%9E%E6%B0%8F%E6%98%93%E4%BA%8B.djvu.3kii8ipd.tmp'`

What I have done :

googled, tested with several different pages in the wiki.

asking chatGPT and get the code with absolute path but doesn't work

import os
import wget

def download_file(url, save_path):
    try:
        print("Downloading file...")
        wget.download(url, save_path)
        print("\nDownload complete!")
    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
    # URL of the file to download
    file_url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/CADAL06210101_%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%B6%8C%E7%B7%A8%EF%BC%9A%E6%98%93%E5%9C%96%E6%A2%9D%E8%BE%AE%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%B6%8C%E7%B7%A8%EF%BC%9A%E8%99%9E%E6%B0%8F%E6%98%93%E4%BA%8B.djvu"
    
    # Specify an absolute path for saving the file
    save_location = os.path.join(os.getcwd(), "downloaded_file.djvu")
    
    # Call the function to download the file
    download_file(file_url, save_location)

The code :

The code below is the code with URL included which doesn't work.

import wget

url='https://upload.wikimedia.org/wikipedia/commons/a/a7/CADAL06210101_%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E6%98%93%E5%9C%96%E6%A2%9D%E8%BE%AE%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E8%99%9E%E6%B0%8F%E6%98%93%E4%BA%8B.djvu'

wget.download(url)

maybe :

.djvu.3kii8ipd.tmp'

problem with this weird .tmp name shown on error message but have no idea.

Thanks for reading. Appreciate so much for the help.


Solution

  • wget appears to require a temporary location for reasons I do not understand. wget was last updated 9 years ago and may no longer be robust.

    You can achieve this reliably and easily with requests as follows:

    import requests
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
    }
    
    url = "https://upload.wikimedia.org/wikipedia/commons/a/a7/CADAL06210101_%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E6%98%93%E5%9C%96%E6%A2%9D%E8%BE%AE%E7%9A%87%E6%B8%85%E7%B6%93%E8%A7%A3%E7%BA%8C%E7%B7%A8%EF%BC%9A%E8%99%9E%E6%B0%8F%E6%98%93%E4%BA%8B.djvu"
    output_file = "downloaded_file.djvu"
    chunk = 4096 # usually a good chunk size
    
    with requests.get(url, headers=headers, stream=True) as response:
        response.raise_for_status()
        with open(output_file, "wb") as output:
            for chunk in response.iter_content(chunk):
                output.write(chunk)