Search code examples
djangopython-3.xpython-requestsurllibcontextmanager

File (.tar.gz) download and processing using urlib and requests package-python


SCOPE: Which library to use? urllib Vs requests I was trying to download a log file available at a url. URL was hosted at aws and contained file name as well. Upon accessing the url it gives a .tar.gz file to download. I needed to download this file in the directory of my choice untar and unzip it to reach the json file inside it and finally parse the json file. While searching on internet I found sporadic information spread all over the place. In this Question I try to consolidate it in one place.


Solution

  • Using REQUESTS Library: A PyPi package and considered superior while handling high http requests. Refereces:

    1. https://docs.python.org/3/library/urllib.request.html#module-urllib.request
    2. What are the differences between the urllib, urllib2, urllib3 and requests module?

    CODE:

    import requests
    import urllib.request
    import tempfile
    import shutil
    import tarfile
    import json
    import os
    import re
    
    with requests.get(respurl,stream = True) as File:
        # stream = true is required by the iter_content below
        with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
            with open(tmp_file.name,'wb') as fd:
                for chunk in File.iter_content(chunk_size=128):
                    fd.write(chunk)
    
    with tarfile.open(tmp_file.name,"r:gz") as tf:
        # To save the extracted file in directory of choice with same name as downloaded file.
        tf.extractall(path)
        # for loop for parsing json inside tar.gz file.
        for tarinfo_member in tf:
            print("tarfilename", tarinfo_member.name, "is", tarinfo_member.size, "bytes in size and is", end="")
            if tarinfo_member.isreg():
                print(" a regular file.")
            elif tarinfo_member.isdir():
                print(" a directory.")
            else:
                print(" something else.")
            if os.path.splitext(tarinfo_member.name)[1] == ".json":
                print("json file name:",os.path.splitext(tarinfo_member.name)[0])
                json_file = tf.extractfile(tarinfo_member)
                # capturing json file to read its contents and further processing.
                content = json_file.read()
                json_file_data = json.loads(content)
                print("Status Code",json_file_data[0]['status_code'])
                print("Response Body",json_file_data[0]['response'])
                # Had to decode content again as it was double encoded.
                print("Errors:",json.loads(json_file_data[0]['response'])['errors'])
    
    
    

    To save the extracted file in directory of choice with same name as downloaded file. variable 'path' is formed as follows.

    Where url sample is containing file name '44301621eb-response.tar.gz'

    https://yoursite.com/44301621eb-response.tar.gz?AccessKeyId=your_id&Expires=1575526260&Signature=you_signature

    BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    PROJECT_NAME = 'your_project_name'
    PROJECT_ROOT = os.path.join(BASE_DIR, PROJECT_NAME)
    LOG_ROOT = os.path.join(PROJECT_ROOT, 'log')
    filename = re.split("([^?]+)(?:.+/)([^#?]+)(\?.*)?", respurl)
    # respurl is the url from the where the file will be downloaded 
    path = os.path.join(LOG_ROOT,filename[2])
    

    regex match output from regex101.com enter image description here

    Comparison with urllib

    To know about the subtle differences I implemented same code with urllib as well.

    Notice the usage of tempfile library is slightly different which worked for me. I had to use shutil library with urllib where requests didn't work with shutil library copyfileobj method due to difference response object that we get using urllib and requests.

    with urllib.request.urlopen(respurl) as File:
        with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
            shutil.copyfileobj(File, tmp_file)
    
    with tarfile.open(tmp_file.name,"r:gz") as tf:
        print("Temp tf File:", tf.name)
        tf.extractall(path)
        for tarinfo in tf:
            print("tarfilename", tarinfo.name, "is", tarinfo.size, "bytes in size and is", end="")
            if tarinfo.isreg():
                print(" a regular file.")
            elif tarinfo.isdir():
                print(" a directory.")
            else:
                print(" something else.")
            if os.path.splitext(tarinfo_member.name)[1] == ".json":
                print("json file name:",os.path.splitext(tarinfo_member.name)[0])
                json_file = tf.extractfile(tarinfo_member)
                # capturing json file to read its contents and further processing.
                content = json_file.read()
                json_file_data = json.loads(content)
                print("Status Code",json_file_data[0]['status_code'])
                print("Response Body",json_file_data[0]['response'])
                # Had to decode content again as it was double encoded.
                print("Errors:",json.loads(json_file_data[0]['response'])['errors'])