Search code examples
pythonpandasdictionaryconcatenation

Concat pandas dataframes as dicts


There are two dataframes:

df1

name foo bar
0 value1 value2 value3
1 value4 value5 value6
2 value7 value8 value9
3 value10 value11 value12
4 value13 value14 value15

df2

name foo bar
0 value10 value20 value30
1 value40 value50 value60
2 value70 value80 value90
3 value100 value110 value120
4 value130 value140 value150

Tell me how to create dictionaries from two dataframes. And finally concat, as in the example below? In a real dataframes there are 10000 rows, 100 columns. Dataframes are generated in real time in a loop. Most likely, I can not add them all at once to the list. I can only gradually add them to each other in iteration. If it is optimal to use only incremental concat, then how to avoid constant copying of frames?

name foo bar
0 value1 value2 value3
1 value4 value5 value6
2 value7 value8 value9
3 value10 value11 value12
4 value13 value14 value15
0 value10 value20 value30
1 value40 value50 value60
2 value70 value80 value90
3 value100 value110 value120
4 value130 value140 value150

I am based on the article: Why does concatenation of DataFrames get exponentially slower? and benchmark: https://perfpy.com/16#/

I have tried the following without success. The problem is that I do not really understand how to construct a dictionary from a dataframe correctly in my case.

rows = []

df_a = df1
df_a = df_a.to_dict('dict')

rows.append(df_a)

df_a = df2
df_a = df_a.to_dict('dict')

rows.append(df_a)

df = pd.DataFrame(rows)

As a result, I want to make a cycle of 40 dataframes, similar to df1 and df2, with different values. And finally add the dictionaries into one, making a dataframe out of it


Solution

  • I would do it like this:

    list_dfs = []
    
    for df in [df1, df2]:
       list_dfs.append(df)
    
    df_output = pd.concat(list_dfs)
    

    As opposed to doing something like this (this is bad and memory intensive and slow for large numbers of dataframes):

    df_out = pd.DataFrame()
    for df in [df1, df2]:
         df_out = pd.concat([df_out, df])
         #or
         df_out.append(df)