Search code examples
pythonperformanceswaplarge-data

Swapping IDs & Python Performance


I'm hoping I can get help making my code run more efficiently. The purpose of my code is to take out the first ID (RUID) and replace it with a de-identified ID (RESPID) based on a key file of ids. The input data file is a large tab-delimited text file at about 2.5GB. The data is very wide, each row has thousands of columns. I have a function that works, but on the actual data it is incredibly slow. My first file has been running for 4 days and is only at 1.4GB. I don't know which part of my code is the most problematic, but I suspect it is where I build the row back together and write each row individually. Any advice on how to improve this would be greatly appreciated, 4 days is way too long for processing! Thank you!

def swap():
#input files
infile1 = open(r"Z:\ped_test.txt", 'rb')
keyfile = open(r"Z:\ruid_respid_test.txt", 'rb')

#output file
outfile=open(r"Z:\ped_testRESPID.txt", 'wb')
# create dictionary of RUID-RESPID 
COLUMN = 1 #Column containing RUID
RESPID={}
for k in keyfile:
    kList = k.rstrip('\r\n').split('\t')
    if kList[0] not in RESPID and kList[0] != "":
        RESPID[kList[0]]=kList[1]
#print RESPID
print "creating RESPID-RUID xwalk dictionary is done"

print "Start creating new file"
print str(datetime.datetime.now())
count=0
for line in infile1:
 #if not re.match('#', line): #if there is a header     
    sline = line.split()
    #slen = len(sline)
    RUID = sline[COLUMN]
    #print RUID
    C0 = sline[0]
    #print C0
    DAT=sline[2:]

    for key in RESPID:
        if key==RUID:
            NewID=RESPID[key]
    row=str(C0+'\t'+NewID)
    for a in DAT:
        row=row+'\t'+a
    #print row
outfile.write(row)
outfile.write('\n')

infile1.close()
keyfile.close()
outfile.close()

print "All Done: RESPID replacement is complete"
print str(datetime.datetime.now())

Solution

  • You have several places you can speed things up. Primarily, its a problem with enumerating all of the keys in RESPID when you can just use the 'get' function to read the value. But since you have very wide lines, there are a couple of other tweeks that will make a difference.

    def swap():
        #input files
        infile1 = open(r"Z:\ped_test.txt", 'rb')
        keyfile = open(r"Z:\ruid_respid_test.txt", 'rb')
    
        #output file
        outfile=open(r"Z:\ped_testRESPID.txt", 'wb')
        # create dictionary of RUID-RESPID 
        COLUMN = 1 #Column containing RUID
        RESPID={}
        for k in keyfile:
            kList = k.split('\t', 2)   # minor: jut grab what you need
            if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first
                RESPID[kList[0]]=kList[1]
        #print RESPID
        print "creating RESPID-RUID xwalk dictionary is done"
    
        print "Start creating new file"
        print str(datetime.datetime.now())
        count=0
        for line in infile1:
         #if not re.match('#', line): #if there is a header     
            sline = line.split('\t', 2) # minor: just grab what you need
            #slen = len(sline)
            RUID = sline[COLUMN]
            #print RUID
            C0 = sline[0]
            #print C0
            DAT=sline[2:]
    
            # the biggie, just use a lookup
            #for key in RESPID:
            #   if key==RUID:
            #       NewID=RESPID[key]
            rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]])
    
            #row=str(C0+'\t'+NewID)
            #for a in DAT:
            #   row=row+'\t'+a
            #print row
        outfile.write(row)
        outfile.write('\n')
    
        infile1.close()
        keyfile.close()
        outfile.close()
    
        print "All Done: RESPID replacement is complete"
        print str(datetime.datetime.now())