Search code examples
pythonpandaspython-multiprocessingparquet

Efficiently write large pandas data to different files


I have a pandas dataframe of about 2 million rows (80 columns each).

I would like to output the dataframe to csv as well as a parquet file.

Assume dataframe is present in the df variable

Initial approach:

print('Creating csv and parquet files')
st = time.time()
df.to_csv('output_file.csv')
df.to_parquet('output_file.parquet')
print(f'Created csv and parquet files in {time.time() - st} seconds')

Writing to files using this approach takes too long. I assumed since these two are separate operations, I can take the advantage of multiple processes.

Newer approach:

def build_csv(dataframe, output_filename):
    print(f'Building csv: {output_filename}')
    dataframe.to_csv(output_filename)


def build_parquet(dataframe, output_filename):
    print(f'Building parquet: {output_filename}')
    dataframe.to_parquet(output_filename)


with ProcessPoolExecutor(max_workers=3) as executor:
    executor.submit(build_csv, (df, 'output_file.csv'))
    executor.submit(build_parquet, (df, 'output_file.parquet'))

The newer approach runs successfully but I do not see any files being created. I am not sure why this happens.

Is there a better (quicker) approach for writing a pandas dataframe to different files?


Solution

  • EDIT: I kept the threading solution below for your reference. However, this solution should solve the Python GIL problem. I've tested it and can see that the files have been written successfully:

    from multiprocessing import Pool
    import pandas as pd
    
    # original data:
    data = pd.DataFrame([
        [ 1, 2, 3, 4,], 
        [ 1, 2, 3, 4,], 
        [ 1, 2, 3, 4,], 
        [ 1, 2, 3, 4,], 
        [ 1, 2, 3, 4,],
    ])    
    
    
    def SaveDataToCsv(data):
        print('Started export to .csv')
        data.to_csv('data.csv')
        print('Finished export to .csv')
    
    
    def SaveDataToParquet(data):
        print('Started export to .parquet')
        data.to_parquet('data.parquet')
        print('Finished export to .parquet')
    
    
    # multiprocessing method:
    pool = Pool(processes=2)
    process1 = pool.apply_async(SaveDataToCsv, [data])
    process2 = pool.apply_async(SaveDataToParquet, [data])
    

    Tested the threading library and it seems to work fine:

    import pandas as pd
    import threading
    
    # original data:
    data = pd.DataFrame([
        [ 1, 2, 3, 4,],
        [ 1, 2, 3, 4,],
        [ 1, 2, 3, 4,],
        [ 1, 2, 3, 4,],
        [ 1, 2, 3, 4,],
    ])
    
    
    def SaveDataToCsv(data):        
        data.to_csv('data.csv')
    
    
    def SaveDataToParquet(data):
        data.to_parquet('data.parquet')    
    
    thread1 = threading.Thread(target=SaveDataToCsv, args=(data,))
    thread2 = threading.Thread(target=SaveDataToParquet, args=(data,))
    
    thread1.start()
    thread2.start()