Search code examples
pythonshutiltarfile

move files from child directories (from unzipping) to parent directory in unzip step?


I've got a specific problem: I am downloading some large sets of data using requests. Each request provides me with a compressed file, containing a manifest of the download, and folders, each containing 1 file.

I can unzip the archive + remove archive, and afterwards extract all files from subdirectories + remove subdirectories.

Is there a way to combine this? Since I'm new to both actions, I studied some tutorials and stack overflow questions on both topics. I'm glad it is working, but I'd like to refine my code and possibly combine these two steps - I didn't encounter it while I was browsing other information.

So for each set of parameters, I perform a request which ends up with:

# Write the file
with open((file_location+file_name), "wb") as output_file:
    output_file.write(response.content)
# Unzip it
with tarfile.open((file_location+file_name), "r:gz") as tarObj:
    tarObj.extractall(path=file_location)
# Remove compressed file
os.remove(file_location+file_name)

And then for the next step I wrote a function that:

target_dir = keyvalue[1] # target directory is stored in this tuple
subdirs = get_imm_subdirs(target_dir) # function to get subdirectories
for f in subdirs:
    c = os.listdir(os.path.join(target_dir, f)) # find file in subdir
    shutil.move(c, str(target_dir)+"ALL_FILES/") # move them into 1 subdir
os.rmdir([os.path.join(target_dir, x) for x in subdirs]) # remove other subdirs

Is there an action I can perform during the unzip step?


Solution

  • You can extract the files individually rather than using extractall.

    with tarfile.open('musthaves.tar.gz') as tarObj:
        for member in tarObj.getmembers():
            if member.isfile():
                member.name = os.path.basename(member.name)
                tarObj.extract(member, ".")
    

    With appropriate credit to this SO question and the tarfile docs.

    getmembers() will provide a list what is inside the archive (as objects); you could use listnames() but then you'd have to devise you own test as to whether or not each entry is a file or directory.

    isfile() - if it's not a file, you don't want it.

    member.name = os.path.basename(member.name) resets the subdirectory depth - the extractor things everything is at the top level.