I am trying to create a large flat file with fixed width columns that contains multiple layers, but processing seems to be very slow, most likely because I am iterating over each row. For context, this is for transmitting insurance policy information.
The hierarchy goes like this:
-Policy row
--Property on policy
---Coverage on property
--Property on policy
---Coverage on property
--Owner on policy
--Owner on policy
--Owner on policy
Currently I'm loading the four record types into separate dataframes, and then doing a for loop over each type by pulling them based on the parent record's ID, and then writing them to the file. I'm hoping for some sort of hierarchical dataFrame merge that doesn't force me to scan the file each time I want a record.
import re
import pandas as pd
import math
def MakeNumeric(instring):
output = re.sub('[^0-9]', '', str(instring))
return str(output)
def Pad(instring, padchar, length, align):
if instring is None: # Takes care of NULL values
instring = ''
instring = str(instring).upper()
instring = instring.replace(',', '').replace('\n', '').replace('\r', '')
instring = instring[:length]
if align == 'L':
output = instring + (padchar * (length - len(instring)))
elif align == 'R':
output = (padchar * (length - len(instring))) + instring
else:
output = instring
return output
def FileCreation():
POLR = pd.read_parquet(r'POLR.parquet')
PRP1 = pd.read_parquet(r'PRP1.parquet')
PROP = pd.read_parquet(r'PROP.parquet')
SUBJ = pd.read_parquet(r'SUBJ.parquet')
rownum = 1
totalrownum = 1
POLRCt = 0
size = 900000
POLR = [POLR.loc[i:i + size - 1, :] for i in range(0, len(POLR), size)]
FileCt = 0
print('Predicted File Count: ' + str(math.ceil(len(POLR[0])/ size)) )
for df in POLR:
FileCt += 1
filename = r'OutputFile.' + Pad(FileCt, '0', 2, 'R')
with open(filename, 'a+') as outfile:
for i, row in df.iterrows():
row[0] = Pad(rownum, '0', 9, 'R')
row[1] = Pad(row[1], ' ', 4, 'L')
row[2] = Pad(row[2], '0', 5, 'R')
# I do this for all 50 columns
outfile.write((','.join(row[:51])).replace(',', '') + '\n')
rownum += 1
totalrownum += 1
for i2, row2 in PROP[PROP.ID == row[51]].iterrows():
row2[0] = Pad(rownum, '0', 9, 'R')
row2[1] = Pad(row2[1], ' ', 4, 'L')
row2[2] = Pad(row2[2], '0', 5, 'R')
# I do this for all 105 columns
outfile.write((','.join(row2[:106])).replace(',', '') + '\n')
rownum += 1
totalrownum += 1
for i3, row3 in PRP1[(PRP1['id'] == row2['ID']) & (PRP1['VNum'] == row2['vnum'])].iterrows():
row3[0] = Pad(rownum, '0', 9, 'R')
row3[1] = Pad(row3[1], ' ', 4, 'L')
row3[2] = Pad(row3[2], '0', 5, 'R')
# I do this for all 72 columns
outfile.write((','.join(row3[:73])).replace(',', '') + '\n')
rownum += 1
totalrownum += 1
for i2, row2 in SUBJ[SUBJ['id'] == row['id']].iterrows():
row2[0] = Pad(rownum, '0', 9, 'R')
row2[1] = Pad(row2[1], ' ', 4, 'L')
row2[2] = Pad(row2[2], '0', 5, 'R')
# I do this for all 24 columns
outfile.write((','.join(row2[:25])).replace(',', '') + '\n')
rownum += 1
totalrownum += 1
POLRCt += 1
print('File {} of {} '.format(str(FileCt),str(len(POLR)) ) + str((POLRCt - 1) / len(df.index) * 100) + '% Finished\r')
rownum += 1
rownum = 1
POLRCt = 1
I'm essentially looking for a script that doesn't take multiple days to create a 27M record file.
I ended up populating temp tables for each record level, and creating keys, then inserting them into a permanent staging table and assigning an clustered index to the keys.
I then queried the results while using OFFSET
and FETCH NEXT %d ROWS ONLY
to reduce memory size. I then used the multiprocessing library to break the workload out for each thread on the CPU.
Ultimately, the combination of these have reduced the runtime to about 20% of what it was when this question was originally posted.