Search code examples
pythondataframe

Python and Pandas to write large amount dataframe


A script is generating 50000 dataframe of below structure, and save them 1-by-1 to local disk. To improve efficiency, I changed the write-to format from Excel to Parquet. However, it seems not running faster.

The 50000 dataframe are then to be lopped to filter the rows when the Meet column is 4 or 5, to save them in a final txt file.

What's the better solution to above? (to have a final file only containing rows with 4 or 5 in Meet)

I'm thinking perhaps when generating the 50000 dataframe, it filters the rows (4 or 5 in Meet) instead of saving all rows in each smaller dataframe. And, instead of writing 50000 dataframe, it directly writes to the final txt file. That is, skip the step of writing each small dataframe as a small file to local disk.

Number of rows may be million. I'm not sure if a normal laptop can handle that (Win11, 16GB RAM, no internet connection).

                             Dict                  DT  Length  Meet
0  {'key_0': 45, 'key_1': 67}  2023-10-15 14:32:10      15     5
1  {'key_0': 12, 'key_1': 34}  2023-10-12 09:15:45      19     3
2  {'key_0': 56, 'key_1': 89}  2023-10-20 11:45:30      13     7
3  {'key_0': 23, 'key_1': 45}  2023-10-05 08:20:00      17     4
4  {'key_0': 78, 'key_1': 12}  2023-10-10 16:05:55      10     6

Due to the length of the code (1315 lines) and privacy, sorry that I am not able to paste the code here. I am trying to write to 1 final dataframe directly, just power accidentally lost once that it needs to rerun.

big_df = []
    ......
    - - - lines to generate df_small - - -
    df_small = df_small[df_small['Meet'].isin([4,5])]
    big_df.append(df_1)

writing_df = pd.concat(big_df, ignore_index=True)
writing_df.to_excel('final.xlsx', index=False)

Solution

  • Since you have a working process to generate the 50k data frames, consider adding one more step to existing looping logic to filter and save data to a growing CSV file. Specifically, turn this filtering and saving process into a defined function that appends to CSV file. Then, at the end of your main looping logic, run each data frame through it. This avoids building large objects with list.append, pd.concat, even assignment using =.

    def filter_save_data(df, csv_file):
       df[df['Meet'].isin([4,5]).to_csv(
           path_or_buf = csv_file, header = False, index = False, mode = "a"
       )
    
    
    for ... :
       # ... looping logic to generate data ...
    
       filter_save_data(df, "output.csv")
    
       # ... other processing with data ...
    

    Be sure to delete the CSV file and initialize with columns before looping logic.


    In lieu of appending to a CSV, consider using DataFrame.to_sql to append to a table within a database like file-level DBMS, SQLite, with a DB-API that is part of the Python standard library:

    from sqlalchemy import create_engine
    ...
    
    
    def filter_save_data(df):
       engine = create_engine('sqlite:////home/path/to/mydatabase.db') # UNIX
       engine = create_engine(r'sqlite:///C:\path\to\mydatabase.db')   # WINDOWS
    
       df[df['Meet'].isin([4,5]).to_sql(
           name = "mytable",
           con = engine,
           index = False,
           if_exists = "append"
       )
    
    
    for ... :
       # ... looping logic to generate data ...
    
       filter_save_data(df)
    
       # ... other processing with data ...
    

    To later read back into pandas:

    import sqlite3
    import pandas as pd
    
    conn = sqlite3.conn("/path/to/mydatabase.db")
    
    final_df = pd.read_sql_table("mytable", conn)