django python-3.x python-requests urllib contextmanager

File (.tar.gz) download and processing using urlib and requests package-python

SCOPE: Which library to use? urllib Vs requests I was trying to download a log file available at a url. URL was hosted at aws and contained file name as well. Upon accessing the url it gives a .tar.gz file to download. I needed to download this file in the directory of my choice untar and unzip it to reach the json file inside it and finally parse the json file. While searching on internet I found sporadic information spread all over the place. In this Question I try to consolidate it in one place.

Solution

Using REQUESTS Library: A PyPi package and considered superior while handling high http requests. Refereces:

CODE:

import requests
import urllib.request
import tempfile
import shutil
import tarfile
import json
import os
import re

with requests.get(respurl,stream = True) as File:
    # stream = true is required by the iter_content below
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        with open(tmp_file.name,'wb') as fd:
            for chunk in File.iter_content(chunk_size=128):
                fd.write(chunk)

with tarfile.open(tmp_file.name,"r:gz") as tf:
    # To save the extracted file in directory of choice with same name as downloaded file.
    tf.extractall(path)
    # for loop for parsing json inside tar.gz file.
    for tarinfo_member in tf:
        print("tarfilename", tarinfo_member.name, "is", tarinfo_member.size, "bytes in size and is", end="")
        if tarinfo_member.isreg():
            print(" a regular file.")
        elif tarinfo_member.isdir():
            print(" a directory.")
        else:
            print(" something else.")
        if os.path.splitext(tarinfo_member.name)[1] == ".json":
            print("json file name:",os.path.splitext(tarinfo_member.name)[0])
            json_file = tf.extractfile(tarinfo_member)
            # capturing json file to read its contents and further processing.
            content = json_file.read()
            json_file_data = json.loads(content)
            print("Status Code",json_file_data[0]['status_code'])
            print("Response Body",json_file_data[0]['response'])
            # Had to decode content again as it was double encoded.
            print("Errors:",json.loads(json_file_data[0]['response'])['errors'])

To save the extracted file in directory of choice with same name as downloaded file. variable 'path' is formed as follows.

Where url sample is containing file name '44301621eb-response.tar.gz'

https://yoursite.com/44301621eb-response.tar.gz?AccessKeyId=your_id&Expires=1575526260&Signature=you_signature

BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
PROJECT_NAME = 'your_project_name'
PROJECT_ROOT = os.path.join(BASE_DIR, PROJECT_NAME)
LOG_ROOT = os.path.join(PROJECT_ROOT, 'log')
filename = re.split("([^?]+)(?:.+/)([^#?]+)(\?.*)?", respurl)
# respurl is the url from the where the file will be downloaded 
path = os.path.join(LOG_ROOT,filename[2])

regex match output from regex101.com

Comparison with urllib

To know about the subtle differences I implemented same code with urllib as well.

Notice the usage of tempfile library is slightly different which worked for me. I had to use shutil library with urllib where requests didn't work with shutil library copyfileobj method due to difference response object that we get using urllib and requests.

with urllib.request.urlopen(respurl) as File:
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        shutil.copyfileobj(File, tmp_file)

with tarfile.open(tmp_file.name,"r:gz") as tf:
    print("Temp tf File:", tf.name)
    tf.extractall(path)
    for tarinfo in tf:
        print("tarfilename", tarinfo.name, "is", tarinfo.size, "bytes in size and is", end="")
        if tarinfo.isreg():
            print(" a regular file.")
        elif tarinfo.isdir():
            print(" a directory.")
        else:
            print(" something else.")
        if os.path.splitext(tarinfo_member.name)[1] == ".json":
            print("json file name:",os.path.splitext(tarinfo_member.name)[0])
            json_file = tf.extractfile(tarinfo_member)
            # capturing json file to read its contents and further processing.
            content = json_file.read()
            json_file_data = json.loads(content)
            print("Status Code",json_file_data[0]['status_code'])
            print("Response Body",json_file_data[0]['response'])
            # Had to decode content again as it was double encoded.
            print("Errors:",json.loads(json_file_data[0]['response'])['errors'])