Search code examples
pythonpandaslarge-filesfile-writingfixed-width

How do I speed up this file creation process?


I am trying to create a large flat file with fixed width columns that contains multiple layers, but processing seems to be very slow, most likely because I am iterating over each row. For context, this is for transmitting insurance policy information.

The hierarchy goes like this:

-Policy row
--Property on policy
---Coverage on property
--Property on policy
---Coverage on property
--Owner on policy
--Owner on policy
--Owner on policy

Currently I'm loading the four record types into separate dataframes, and then doing a for loop over each type by pulling them based on the parent record's ID, and then writing them to the file. I'm hoping for some sort of hierarchical dataFrame merge that doesn't force me to scan the file each time I want a record.

import re
import pandas as pd
import math


def MakeNumeric(instring):
    output = re.sub('[^0-9]', '', str(instring))
    return str(output)

def Pad(instring, padchar, length, align):
    if instring is None:  # Takes care of NULL values
        instring = ''
    instring = str(instring).upper()
    instring = instring.replace(',', '').replace('\n', '').replace('\r', '')
    instring = instring[:length]
    if align == 'L':
        output = instring + (padchar * (length - len(instring)))
    elif align == 'R':
        output = (padchar * (length - len(instring))) + instring
    else:
        output = instring
    return output

def FileCreation():
    POLR = pd.read_parquet(r'POLR.parquet')
    PRP1 = pd.read_parquet(r'PRP1.parquet')
    PROP = pd.read_parquet(r'PROP.parquet')
    SUBJ = pd.read_parquet(r'SUBJ.parquet')
    rownum = 1
    totalrownum = 1
    POLRCt = 0
    size = 900000
    POLR = [POLR.loc[i:i + size - 1, :] for i in range(0, len(POLR), size)]
    FileCt = 0
    print('Predicted File Count: ' + str(math.ceil(len(POLR[0])/ size)) )
    for df in POLR:
        FileCt += 1
        filename = r'OutputFile.' + Pad(FileCt, '0', 2, 'R')
        with open(filename, 'a+') as outfile:
            for i, row in df.iterrows():
                row[0] = Pad(rownum, '0', 9, 'R')
                row[1] = Pad(row[1], ' ', 4, 'L')
                row[2] = Pad(row[2], '0', 5, 'R')
                # I do this for all 50 columns
                outfile.write((','.join(row[:51])).replace(',', '') + '\n')
                rownum += 1
                totalrownum += 1
                for i2, row2 in PROP[PROP.ID == row[51]].iterrows():
                    row2[0] = Pad(rownum, '0', 9, 'R')
                    row2[1] = Pad(row2[1], ' ', 4, 'L')
                    row2[2] = Pad(row2[2], '0', 5, 'R')
                    # I do this for all 105 columns
                    outfile.write((','.join(row2[:106])).replace(',', '') + '\n')
                    rownum += 1
                    totalrownum += 1
                    for i3, row3 in PRP1[(PRP1['id'] == row2['ID']) & (PRP1['VNum'] == row2['vnum'])].iterrows():
                        row3[0] = Pad(rownum, '0', 9, 'R')
                        row3[1] = Pad(row3[1], ' ', 4, 'L')
                        row3[2] = Pad(row3[2], '0', 5, 'R')
                        # I do this for all 72 columns
                        outfile.write((','.join(row3[:73])).replace(',', '') + '\n')
                        rownum += 1
                        totalrownum += 1
                for i2, row2 in SUBJ[SUBJ['id'] == row['id']].iterrows():
                    row2[0] = Pad(rownum, '0', 9, 'R')
                    row2[1] = Pad(row2[1], ' ', 4, 'L')
                    row2[2] = Pad(row2[2], '0', 5, 'R')
                    # I do this for all 24 columns
                    outfile.write((','.join(row2[:25])).replace(',', '') + '\n')
                    rownum += 1
                    totalrownum += 1
                POLRCt += 1
                print('File {} of {} '.format(str(FileCt),str(len(POLR)) ) + str((POLRCt - 1) / len(df.index) * 100) + '% Finished\r')
            rownum += 1
        rownum = 1
        POLRCt = 1

I'm essentially looking for a script that doesn't take multiple days to create a 27M record file.


Solution

  • I ended up populating temp tables for each record level, and creating keys, then inserting them into a permanent staging table and assigning an clustered index to the keys. I then queried the results while using OFFSET and FETCH NEXT %d ROWS ONLY to reduce memory size. I then used the multiprocessing library to break the workload out for each thread on the CPU. Ultimately, the combination of these have reduced the runtime to about 20% of what it was when this question was originally posted.