Search code examples
python-2.7multiprocessingpyodbcpython-multiprocessing

Getting too many deadlock errors while updating MSSQL table with pyodbc in parallel with multiprocessing


I am trying to open pickle files that have data within them, then update a MSSQL table with that data. It was taking forever, 10 days to update 1,000,000 rows. So i wrote a script for more parallelism. The more processes i run it with the more errors i get like this

(<class 'pyodbc.Error'>, Error('40001', '[40001] [Microsoft][ODBC SQL Server Dri
ver][SQL Server]Transaction (Process ID 93) was deadlocked on lock resources wit
h another process and has been chosen as the deadlock victim. Rerun the transact
ion. (1205) (SQLExecDirectW)'), <traceback object at 0x0000000002791808>)  

As you can see in my code i keep trying to process the update until successful and even sleep for a second here

while True:
    try:
        updated = cursor.execute(update,'Yes', fileName+'.'+ext, dt, size,uniqueID )
        break
    except:
        time.sleep(1)
        print sys.exc_info() 

Is this because when you use the multiprocessing module in windows it uses os.spawn instead of os.fork ?

Is there a way to do this that will provide more speed up?

I was told that the table can handle way more transactions then this...

#!C:/Python/python.exe -u

import pyodbc,re,pickle,os,glob,sys,time
from multiprocessing import Lock, Process, Queue, current_process


def UpDater(pickleQueue):

   for pi in iter(pickleQueue.get, 'STOP'):
        name = current_process().name
        f=pi

        cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=database.windows.net;DATABASE=DB;UID=user;PWD=pwd');
        cursor = cnxn.cursor()
        update = ("""UPDATE DocumentList
                SET Downloaded=?, DownLoadedAs=?,DownLoadedWhen=?,DownLoadedSizeKB=?
                WHERE DocNumberSequence=?""")

        r = re.compile('\d+')

        pkl_file = open(pi, 'rb')
        meta = pickle.load(pkl_file)
        fileName = meta[0][0]
        pl = r.findall(fileName)
        l= int(len(pl)-1)
        ext = meta[0][1]
        url = meta[0][2]
        uniqueID = pl[l]
        dt = meta[0][4]
        size = meta[0][5]

        while True:
            try:
                updated = cursor.execute(update,'Yes', fileName+'.'+ext, dt, size,uniqueID )
                break
            except:
                time.sleep(1)
                print sys.exc_info() 

        print uniqueID  

        cnxn.commit()
        pkl_file.close()
        os.remove(fileName+'.pkl')
        cnxn.close()

if __name__ == '__main__':

    os.chdir('Pickles')
    pickles = glob.glob("*.pkl")
    pickleQueue=Queue();processes =[];

    for item in pickles:
        pickleQueue.put(item)


    workers = int(sys.argv[1]);
    for x in xrange(workers):
            p = Process(target=UpDater,args=(pickleQueue,))
            p.start()
            processes.append(p)
            pickleQueue.put('STOP')

    for p in processes:
        p.join()

I am using Windows 7 and python 2.7 Anaconda Distribution

EDIT The answer below to use row locks stopped the error from happening. However, the updates were still slow. Turns out an old fashion index on the primary key was needed for 100x speed up


Solution

  • A few things to try. Using sleeps is a bad idea. First, could you try row level locking?

        update = ("""UPDATE DocumentList WITH (ROWLOCK)
                SET Downloaded=?, DownLoadedAs=?,DownLoadedWhen=?,DownLoadedSizeKB=?
                WHERE DocNumberSequence=? """)
    

    Another option would be to wrap each in a transaction:

        update = ("""
            BEGIN TRANSACTION my_trans;
                UPDATE DocumentList
                SET Downloaded=?, DownLoadedAs=?,DownLoadedWhen=?,DownLoadedSizeKB=?
                WHERE DocNumberSequence=?;
            END TRANSACTION my_trans;
        """)
    

    Would either of these solutions work for you?