Search code examples
pythonpandasperformanceconcatenation

Is there a better way to improve the concat speed?


I'm a student in Korea and I'm using python to analyze option data(finance). I'm finding a better way to speeding up the performance of my python code.

Target data is the transaction record(per minute) of the options and the period is from 2015 to 2019. Because the data is divided into 1227(the number of workdays during 5 years) files(txt), I tried to concatenate all 1227 files to minimize the number of accession to the memory. This is because I will use the result file(concatenated file = preprocessed file) repeatedly and accessing every separated file took too much time. Below is some part of my code.

#file_name is list type and it contains all names of the 1227 day files ordered by date

result_df = pd.DataFrame()
for f in file_name: 

    data_opt = pd.read_csv(location + f, header = None, sep = "\t")

    #do something
    #...
    #...

    oneday_df = pd.concat([minute_call, minute_put], axis = 0) #result of the processing one day data

    result_df = pd.concat([result_df, oneday_df], axis = 0)

result_df.to_csv()

This code works and I could get the proper result. However, I could see that the speed slowed down as time goes by. It means that my code works fast when it processes early data but its speed slows down when it processes late data. Is there any better way to speeding up the performance of my python code?

(Sorry for my awkward English and thank you for reading all questions)


Solution

  • Rather than concatenating in memory, keep the output CSV file open and write each part to it separately as you go?

    That way you'll never have more than one day's worth of data in memory at a time, improving not only speed but also memory consumption.

    Something like:

    with open('out_file.csv', 'w') as of:
        for i, f in enumerate(file_name): 
    
            data_opt = pd.read_csv(location + f, header = None, sep = "\t")
    
            #do something
            #...
            #...
    
            oneday_df = pd.concat([minute_call, minute_put], axis = 0) #result of the processing one day data
    
            is_first_part = (i == 0)
            oneday_df.to_csv(of, header=is_first_part)