I am trying to open pickle files that have data within them, then update a MSSQL table with that data. It was taking forever, 10 days to update 1,000,000 rows. So i wrote a script for more parallelism. The more processes i run it with the more errors i get like this
(<class 'pyodbc.Error'>, Error('40001', '[40001] [Microsoft][ODBC SQL Server Dri
ver][SQL Server]Transaction (Process ID 93) was deadlocked on lock resources wit
h another process and has been chosen as the deadlock victim. Rerun the transact
ion. (1205) (SQLExecDirectW)'), <traceback object at 0x0000000002791808>)
As you can see in my code i keep trying to process the update until successful and even sleep for a second here
while True:
try:
updated = cursor.execute(update,'Yes', fileName+'.'+ext, dt, size,uniqueID )
break
except:
time.sleep(1)
print sys.exc_info()
Is this because when you use the multiprocessing module in windows it uses os.spawn instead of os.fork ?
Is there a way to do this that will provide more speed up?
I was told that the table can handle way more transactions then this...
#!C:/Python/python.exe -u
import pyodbc,re,pickle,os,glob,sys,time
from multiprocessing import Lock, Process, Queue, current_process
def UpDater(pickleQueue):
for pi in iter(pickleQueue.get, 'STOP'):
name = current_process().name
f=pi
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=database.windows.net;DATABASE=DB;UID=user;PWD=pwd');
cursor = cnxn.cursor()
update = ("""UPDATE DocumentList
SET Downloaded=?, DownLoadedAs=?,DownLoadedWhen=?,DownLoadedSizeKB=?
WHERE DocNumberSequence=?""")
r = re.compile('\d+')
pkl_file = open(pi, 'rb')
meta = pickle.load(pkl_file)
fileName = meta[0][0]
pl = r.findall(fileName)
l= int(len(pl)-1)
ext = meta[0][1]
url = meta[0][2]
uniqueID = pl[l]
dt = meta[0][4]
size = meta[0][5]
while True:
try:
updated = cursor.execute(update,'Yes', fileName+'.'+ext, dt, size,uniqueID )
break
except:
time.sleep(1)
print sys.exc_info()
print uniqueID
cnxn.commit()
pkl_file.close()
os.remove(fileName+'.pkl')
cnxn.close()
if __name__ == '__main__':
os.chdir('Pickles')
pickles = glob.glob("*.pkl")
pickleQueue=Queue();processes =[];
for item in pickles:
pickleQueue.put(item)
workers = int(sys.argv[1]);
for x in xrange(workers):
p = Process(target=UpDater,args=(pickleQueue,))
p.start()
processes.append(p)
pickleQueue.put('STOP')
for p in processes:
p.join()
I am using Windows 7 and python 2.7 Anaconda Distribution
EDIT The answer below to use row locks stopped the error from happening. However, the updates were still slow. Turns out an old fashion index on the primary key was needed for 100x speed up
A few things to try. Using sleeps is a bad idea. First, could you try row level locking?
update = ("""UPDATE DocumentList WITH (ROWLOCK)
SET Downloaded=?, DownLoadedAs=?,DownLoadedWhen=?,DownLoadedSizeKB=?
WHERE DocNumberSequence=? """)
Another option would be to wrap each in a transaction:
update = ("""
BEGIN TRANSACTION my_trans;
UPDATE DocumentList
SET Downloaded=?, DownLoadedAs=?,DownLoadedWhen=?,DownLoadedSizeKB=?
WHERE DocNumberSequence=?;
END TRANSACTION my_trans;
""")
Would either of these solutions work for you?