Search code examples
pythonmultithreadingalgorithmqueuerace-condition

Race Condition with Thread in Python


I back up files with the backupFile function, I add the backed up files to the hashList by hashing them, and when backing up other files, I check whether they have been backed up before by looking at the hashList. I can backup multiple files at the same time using thread and queue, but I get a race condition error because more than one thread is processing on the same hashList. I used a lock to solve this, but using lock = threading.Lock() prevents parallelism. While a thread is running, other threads are waiting. which makes my purpose of using threads meaningless. Because my purpose of using threads was to save time.

I want to both use the thread and check if the file has been backed up before.

I may be asking a lot but I need your ideas, thanks

my code;

import threading, hashlib, queue, os


def hashFile(fileName):
    with open(fileName, "rb") as f:
        sha256 = hashlib.sha256()
        while chunk := f.read(4096):
            sha256.update(chunk)
        return sha256.hexdigest()


def backupFile(q):
    while not q.empty():
        fileName = q.get()

        with lock:
            if hashFile(filesToBackupPath+fileName) in hashList:
                print(f"\033[33m{fileName} daha once yedeklenmis\033[0m")
            else:
                print(f"\033[32m{fileName} yedeklendi\033[0m")
                hashList.append(hashFile(filesToBackupPath+fileName))

        q.task_done()


filesToBackupPath = "yedeklenecekDosyalar/"
fileList = os.listdir(filesToBackupPath)
hashList = []

q = queue.Queue()

for file in fileList:
    q.put(file)

lock = threading.Lock()

for i in range(20):
    t = threading.Thread(target=backupFile, args=(q,))
    t.start()

q.join()

print('\n',len(hashList))

Solution

  • There is no reason for you to be locking the call to hashfile.

    hash = hashFile(filesToBackupPath+fileName)
    
    with lock:
        if hash in hashList:
             alreadyBackedUp = True
        else:
             alreadyBackedUp = False
             hashList.append(hash)
    
    Everything else outside the lock.
    

    The only place you need to lock in when accessing hashList.
    Why are you using a list rather than set?