Search code examples
pythonpython-multithreading

How to make a python script run faster for moving files?


Hi I am new to python and I made this simple program to sort out my downloads folder. When I run this program when i have decent number of files in download folder, it takes roughly around 10 to 15 seconds.

I heard python is slower with respect to C.

Best way to make this code more efficient without parallel threads?

Also, Best way to make it faster with parallel running threads?

My next plan is to use python to make a GUI for a desktop dashboard where I can have this code as runnable icon. How can I achieve that... any ideas would be greatly appreciated. Thanks for the help.

import shutil
import os, time

start = time.time()
print("Sorting Downlaods Folder...")

download_folder = "C:/Users/muham/Downloads"
images = "C:/Users/muham/Pictures"
documents = "C:/Users/muham/Documents"

docs = ["docx", ".txt", ".doc"]
imgs = [".png", ".jpg", "jpeg"]
prgrms = [".exe", ".php", ".c", "java", ".msi"]
    
for file in os.listdir(download_folder):
    if file[-4:] in docs:
        shutil.move(download_folder + '/' + file, documents + '/Docs/' + file)
    elif file[-4:] == "xlsx" or file[-4:] == ".csv":
        shutil.move(download_folder + '/' + file, documents + '/Excel/' + file)
    elif file[-4:] == ".pdf" or file[-4:] == ".ppt" or file[-4:] == ".pptx":
        shutil.move(download_folder + '/' + file, documents + '/PDFs/' + file)
    elif file[-4:] in imgs:
        shutil.move(download_folder + '/' + file, images + '/' + file) 
    elif file[-4:] in prgrms or file[-2:] == ".c":
        shutil.move(download_folder + '/' + file, "D:/programs/" + file) 
    else:
        shutil.move(download_folder + '/' + file, "D:/zips/" + file)

print("Download folder sorted")
end = time.time()
print("Time taken:" , end - start)

Solution

  • The first thing to do is profile this code to find out where it spends the time. Optimize operations in this order:

    1. frequent and take a long time
    2. frequent and take a short time; infrequent and take a long time
    3. infrequent and take a short time

    #2 is a balancing act between the two types of ops, but generally frequent and short will be the better option.

    If the majority of the time is spent in shutil.move() you'll be able to reduce the runtime by at most half, theoretically. Practically, you won't be able to reduce it by this much. In this case, multithreading is suitable because the operation is I/O-bound.


    Absent profile information, the following is the best I can do.

    The easiest way to improve speed is to change docs, imgs and prgms to setS. Typically, a set is faster than a list for collections with more than two elements.

    from timeit import timeit
    stmt = '''
    for word in wordlist:
        word in keys
    '''
    timeit(stmt, setup='import random; keys=["foo", "bar", "baz"]; wordlist = random.choices(keys, k=100)')
    timeit(stmt, setup='import random; keys={"foo", "bar", "baz"}; wordlist = random.choices(list(keys), k=100)')
    
    3.8205395030090585
    2.583121053990908

    Another opportunity lies in using str.rsplit() instead of slicing to get the file extension. A slice is a new string and this code does it multiple times per loop. Do it once at the start of the loop before the conditional block. This will require removing the .s from the extensions in docs, imgs and prgms.

    docs = ["docx", "txt", "doc"]
    imgs = ["png", "jpg", "jpeg"]
    prgrms = ["exe", "php", "c", "java", "msi"]
        
    for file in os.listdir(download_folder):
        extension = file.rsplit(".")
        if extension in docs:
    

    Reorder your conditional based on experimental data so that the most frequently True comparison is first, the second most frequently True comparison is second and so forth.


    Using time.time() to measure execution time is unreliable. This method will not take into account time used by other processes. So, if measuring while other processes are generating high resource contention, the measurement will be greater than the execution time of what you're attempting to measure. Another source of error is that the system time may change during execution (most likely due to a NTP synchronization). These are some of the reasons I recommend profiling with a package like cProfile.


    Edit: I noticed that some of the shutil.move() calls move files from the C drive to the D drive. This can trigger an actual copy rather than rename which can result in a significant increase in runtime, depending on how your filesystems are set up.