Search code examples
pythonpandasperformance

Pandas performance while iterating a state vector


I want to make a pandas dataframe that describes the state of a system at different times

  • I have the initial state which describes the first row
  • Each row correspond to a time
  • I have reveserved the first two columns for "household" / statistics
  • The following columns are state parameters
  • At each iteration/row a number of parameters change - this could be just one or many

I have created a somewhat simplified version that simulates my change data : df_change

Question 1

Can you think of a more efficient way of generating the matrix than what i do in this code? i have a state that i update in a loop and insert

Question 2

This is what i discovered while trying to write the sample code for this discussion. I see 20 fold performanne boost in loop iteration performance if i do the assignments to the "household" columns after the loop. Why is this? I am using python = 3.12.4 and pandas 2.2.2.

df["product"] ="some_product"
#%%

import numpy as np
import pandas as pd
from tqdm import tqdm
num_cols =600
n_changes = 40000


# simulate changes

extra_colums = ["n_changes","product"]
columns = [chr(i+65) for i in range(num_cols)]

state = { icol : np.random.random() for icol in columns}

change_index = np.random.randint(0,4,n_changes).cumsum()
change_col =  [columns[np.random.randint(0,num_cols)] for i in range(n_changes)]
change_val= np.random.normal(size=n_changes)

# create change matrix
df_change=pd.DataFrame(index= change_index )
df_change['col'] = change_col
df_change['val'] = change_val
index = np.unique(change_index)


# %%
# Slow approach  gives 5 iterations/s
df = pd.DataFrame(index= index, columns=extra_colums + columns)
df["product"] ="some_product"
for i in tqdm(index):
    state.update(zip(df_change.loc[[i],"col"] , df_change.loc[[i],"val"]))
    df.loc[i,columns] = pd.Series(state)

# %%
# Fast approach gives 1000 iterations/sec
df2 = pd.DataFrame(index= index, columns=extra_colums + columns)
for i in tqdm(index):
    state.update(zip(df_change.loc[[i],"col"] , df_change.loc[[i],"val"]))
    df2.loc[i,columns] = pd.Series(state)
df2["product"] ="some_product"

Edit

I marked the answer by ouroboros1 as theaccepted solution - it works really well and answered Question 1.

I am still curios about Question 2 : the difference in pandas performance using the two methods where i iterate through the rows. I found that I can also get a performance similar to the original "df2" method depending on how i assign the value before the loop.

The interesting point here is that pre assignment changes the performance in loop that follows.

# Fast approach gives 1000 iterations/sec
df3 = pd.DataFrame(index=index, columns=extra_colums + columns)

#df3.loc[index,"product"] = "some_product" # Fast  
#df3["product"] = "some_product"           # Slow  
df3.product = "some_product"               # Fast

for i in tqdm(index):
    state.update(zip(df_change.loc[[i], "col"], df_change.loc[[i], "val"]))
    df3.loc[i, columns] = np.array(list(state.values())) 




Solution

  • Here's own approach that should be much faster:

    Data sample

    num_cols = 4
    n_changes = 6
    np.random.seed(0) # reproducibility
    
    # setup ...
    
    df_change
    
       col       val
    1    C  0.144044
    4    A  1.454274
    5    A  0.761038
    7    A  0.121675
    7    C  0.443863
    10   B  0.333674
    
    state
    
    {'A': 0.5488135039273248,
     'B': 0.7151893663724195,
     'C': 0.6027633760716439,
     'D': 0.5448831829968969}
    

    Code

    out = (df_change.reset_index()
           .pivot_table(index='index', 
                        columns='col', 
                        values='val',
                        aggfunc='last')
           .rename_axis(index=None, columns=None)
           .assign(product='some_product')
           .reindex(columns=extra_colums + columns)
           .fillna(pd.DataFrame(state, index=[index[0]]))
           .ffill()
           )
    

    Output

        n_changes       product         A         B         C         D
    1         NaN  some_product  0.548814  0.715189  0.144044  0.544883
    4         NaN  some_product  1.454274  0.715189  0.144044  0.544883
    5         NaN  some_product  0.761038  0.715189  0.144044  0.544883
    7         NaN  some_product  0.121675  0.715189  0.443863  0.544883
    10        NaN  some_product  0.121675  0.333674  0.443863  0.544883
    
    # note:
    # A updated in 4, 5, 7
    # B updated in 10
    # C updated in 1, 7
    

    Explanation / Intermediates

    • Use df.reset_index to access 'index' inside df.pivot_table. For the aggfunc use 'last'. I.e., we only need to propagate the last value in case of duplicate 'col' values per index value.
    • Cosmetic: use df.rename_axis to reset index and columns names to None.
    # df_chagne.reset_index().pivot_table(...).rename_axis(...)
    
               A         B         C
    1        NaN       NaN  0.144044
    4   1.454274       NaN       NaN
    5   0.761038       NaN       NaN
    7   0.121675       NaN  0.443863
    10       NaN  0.333674       NaN
    
    • Use df.assign to add column 'product' with a scalar ('some_product').
    • Use df.reindex to get the columns in the desired order (with extra_columns up front). Not yet existing column 'n_changes' will be added with NaN values.
    • Now, apply df.fillna and use a pd.DataFrame with state for only the first index value (index[0]), to fill the first row (alternatively, use df.combine_first).
    # after .fillna(...)
    
        n_changes       product         A         B         C         D
    1         NaN  some_product  0.548814  0.715189  0.144044  0.544883
    4         NaN  some_product  1.454274       NaN       NaN       NaN
    5         NaN  some_product  0.761038       NaN       NaN       NaN
    7         NaN  some_product  0.121675       NaN  0.443863       NaN
    10        NaN  some_product       NaN  0.333674       NaN       NaN
    
    • Finally, we want to forward fill: df.ffill.

    Performance comparison:

    num_cols = 100
    n_changes = 100
    np.random.seed(0) # reproducibility
    
    # out:
    7.01 ms ± 435 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # df2 (running this *afterwards*, as you are updating `state`
    93.7 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Equality check:

    df2.astype(out.dtypes).equals(out)
    # True