python multithreading encryption python-multiprocessing

multithreading or multiprocessing for encrypting multiple files

i have created a function enc()

def enc():
    password = bytes('asd123','utf-8')
    salt = bytes('asd123','utf-8')
    kdf = PBKDF2HMAC(
        algorithm=hashes.SHA256(),
        length=32,
        salt=salt,
        iterations=10000,
        backend=default_backend())
    key = base64.urlsafe_b64encode(kdf.derive(password))
    f = Fernet(key)

    for file in files:
        with open(file,'rb') as original_file:
            original = original_file.read()

        encrypted = f.encrypt(original)

        with open (file,'wb') as encrypted_file:
            encrypted_file.write(encrypted)

which loops through every file from files and encrypts it.

files = ['D:/folder/asd.txt',
          'D:/folder/qwe.mp4',
          'D:/folder/qwe.jpg']

I wanna use multithreading or multiprocessing to make it faster. Is it possible? Need some help with code.

I tried Multithreading

thread = threading.Thread(target=enc)
thread.start()
thread.join()

But it doesn't seem it improve the speed or time. I need some help implementing multiprocessing. Thanks.

Solution

Threading is not the best candidate for tasks that are CPU intensive unless the task is being performed, for example, by a C-language library routine that releases the Global Interpreter Lock. In any event, you certainly will get any performance gains with multithreading or multiprocessing unless you run multiple processes in parallel.

Let's say you have N tasks and M processor to process the tasks. If the tasks were pure CPU with no I/O (not exactly your situation), there would be no advantage in starting more than M processes to work on your N tasks and for this a multiprocessing pool is the ideal situation. When there is a mix of CPU and I/O, it could be advantageous to have a pool size greater than M, even possibly as large as N if there is a lot of I/O and very little CPU. But in that case it would be better to actually use a combination of a multithreading pool and a multiprocessing pool (of size M) where the multithreading pool was used for all of the I/O work and the multiprocessing pool for the CPU computations. The following code shows that technique:

from multiprocessing.pool import Pool, ThreadPool
from multiprocessing import cpu_count
from functools import partial

def encrypt(key, b):
    f = Fernet(key)
    return f.encrypt(b)

def enc(key, process_pool, file):
    with open(file,'rb') as original_file:
        original = original_file.read()

    encrypted = process_pool.apply(encrypt, args=(key, original,))

    with open (file,'wb') as encrypted_file:
        encrypted_file.write(encrypted)


def main():
    password = bytes('asd123','utf-8')
    salt = bytes('asd123','utf-8')
    kdf = PBKDF2HMAC(
        algorithm=hashes.SHA256(),
        length=32,
        salt=salt,
        iterations=10000,
        backend=default_backend())
    key = base64.urlsafe_b64encode(kdf.derive(password))

    files = ['D:/folder/asd.txt',
              'D:/folder/qwe.mp4',
              'D:/folder/qwe.jpg']

    # Too many threads may be counter productive due to disk contention
    # Should MAX_THREADS be unlimited?
    # For a solid-state drive with no physical arm movement,
    # an extremely large value, e.g. 500, probably would not hurt.
    # For "regular" drives, one needs to experiment
    MAX_THREADS = 500 # Essentially no limit
    # compute number of processes in our pool
    # the lesser of number of files to process and the number of cores we have:
    pool_size = min(MAX_THREADS, cpu_count(), len(files))
    # create process pool:
    process_pool = Pool(pool_size)
    # create thread pool:
    thread_pool = ThreadPool(len(files))
    worker = partial(enc, key, process_pool)
    thread_pool.map(worker, files)

if __name__ == '__main__':
    main()

Comment

Anyway, the point is this: Let's say you had 30 files and 4 cores instead of 3 files. The solution posted by @anarchy would be starting 30 processes and computing f 30 times but could really only utilize effectively 4 processors for the parallel computation of f and for doing the encryption. My solution would use 30 threads for doing the I/O but only start 4 processes thus computing f only 4 times. You save creating 26 processes and 26 computations of f that are useless.

It might even be better to have fewer than 30 threads unless you had a solid state drive since all your threads are contending against the same drive and (1) Each file may be located in a totally different location on the drive and performing concurrent I/O against such files could be counter-productive and (2) There is some maximum throughput that can be achieved by any particular drive.

So perhaps we should have:


    thread_pool = ThreadPool(min(len(files), MAX_THREADS))

where MAX_THREADS is set to some maximum value suitable for your particular drive.

Update

Now the expensive compuation of key is only done once.

The OP's New Problem Running With TKinter

Actually you have two problems. Not only are multiple windows being opened, but you are probably also getting a pickle error trying to call the multiprocessing worker function encrypt because such functions must be defined at global scope and not be nested within another function as you have done.

On platforms that use method spawn to create new processes, such as Windows, to create and initialize each processes in the pool that is created with your process_pool = Pool(pool_size) statement, a new, empty address space is created and a new Python interpreter is launched that re-reads and re-executes the source program in order to initialize the address space before ultimately calling the worker function test. That means that every statement at global scope, i.e. import statements, variable declarations, function declarations, etc., are executed for this purpose. However, in the new subprocess variable __name__ will not be '__main__' so any statements within an if __name__ == '__main__' : block at global scope will not be executed. By the way, that is why for Windows platforms code at global scope that ultimately results in creating new processes is placed within such a block. Failure to do so would result in an infinite recursive process-creation loop if it were to go otherwise undetected. But you placed such a check on __name__ within a nested function where it serves no purpose.

But realizing that all statements at global scope will be executed as part of the initialization of every process in a multiprocessing pool, ideally you should only have at global scope those statements that are required for the initialization of those processes or at least "harmless" statements, i.e. statements whose presence are not overly costly to be executing or have no unpleasant side-effects. Harmful statements should also be placed within an if __name__ == '__main__' : block or moved to within a function.

It should be clear now that the statements you have that create the main window are "harmful" statements that you do not want executed by each newly created process. The tail end of your code should be as follows (I have also incorporated a MAX_THREADS constant to limit the maximum number of threads that will be created although here it is set arbitrarily large -- you should experiment with much smaller values such as 3, 5, 10, 20, etc. to see what gives you the best throughput):

def passerrorbox():
    tk.messagebox.showerror('Password Error','Enter a Password')
    fipasswordbox.delete(0,'end')
    fisaltbox.delete(0,'end')
    filistbox.delete(0,'end')

# Changes start here:

# Get rid of all nesting of functions:
def encrypt(key, a):
    f = Fernet(key)
    return f.encrypt(a)

def enc(key, process_pool, file):
    # File Encryption
    with open(file,'rb') as original_file:
        original = original_file.read()

    encrypted = process_pool.apply(encrypt, args=(key, original,))

    with open (file,'wb') as encrypted_file:
        encrypted_file.write(encrypted)

def encfile(): # was previously named main
    password = bytes(fipasswordbox.get(), 'utf-8')
    salt = bytes(fisaltbox.get(),'utf-8')
    fileln = filistbox.get(0,'end')

    if len(fileln) == 0:
        fierrorbox()
    elif len(password) == 0:
        passerrorbox()
    else:
        file_enc_button['state']='disabled'
        browsefi['state']='disabled'

        fipasswordbox['state']='disabled'
        fisaltbox['state']='disabled'

        kdf = PBKDF2HMAC(
            algorithm=hashes.SHA256(),
            length=32,
            salt=salt,
            iterations=10000,
            backend=default_backend())
        key = base64.urlsafe_b64encode(kdf.derive(password))

        # Too many threads may be counter productive due to disk contention
        # Should MAX_THREADS be unlimited?
        # For a solid-state drive with no physical arm movement,
        # an extremely large value, e.g. 500, probably would not hurt.
        # For "regular" drives, one needs to experiment
        MAX_THREADS = 500 # Essentially no limit
        pool_size = min(MAX_THREADS, cpu_count(), len(fileln))
        process_pool = Pool(pool_size)
        thread_pool = ThreadPool(min(MAX_THREADS, len(fileln)))
        worker = partial(enc, key, process_pool)
        thread_pool.map(worker, fileln)

        fiencdone()

if __name__ == '__main__':
    root = tk.Tk()
    fileframe()
    root.mainloop()