Search code examples
pythonpandasdataframeperformanceslice

pandas dataframe slicing performance is affected by how subset was previously assigned


In a recent post Pandas performance while iterating a state vector, I noticed a performance when slicing pandas dataframes that i do not understand.

The code presented here does not do anything usefull, but highlight the issue:

  • I create a dataframe with two areas of columns named extra_columns and columns
  • The part of the code which takes time to execute is the loop, where slices in columns are assigned.

What baffles me that the way i assign values to extra_columns before the loop affects the loop performance

Python code

import timeit

setup_stmt ="""
import pandas as pd
num_cols = 500
n_iter = 100
extra_column = [ "product"]
columns = [chr(i+65) for i in range(num_cols)]
index= range(n_iter)
"""

stmt1 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
df["product"] = "x"
for i in index:
    df.loc[i,columns] = 0
"""

stmt2 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
df.product = "x"            
for i in index:
    df.loc[i,columns] = 0
"""

stmt3 ="""
df = pd.DataFrame(index= index, columns=extra_column + columns)
df.loc[index,"product"] = "x"
for i in index:
    df.loc[i,columns] = 0
"""

stmt4 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
for i in index:
    df.loc[i,columns] = 0
df["product"] = "x"
"""

print(f" stmt1 takes { timeit.timeit(setup= setup_stmt, stmt= stmt1,  number=10):2.2f} seconds" )
print(f" stmt2 takes { timeit.timeit(setup= setup_stmt, stmt= stmt2,  number=10):2.2f} seconds" )
print(f" stmt3 takes { timeit.timeit(setup= setup_stmt, stmt= stmt3,  number=10):2.2f} seconds" )
print(f" stmt4 takes { timeit.timeit(setup= setup_stmt, stmt= stmt4,  number=10):2.2f} seconds" )

Output

 stmt1 takes 20.60 seconds
 stmt2 takes 0.46 seconds
 stmt3 takes 0.46 seconds
 stmt4 takes 0.46 seconds

Solution

  • TL;DR

    Your original df has a single block in memory.

    With stmt1, use of df["product"] = "x" makes the BlockManager (an internal memory manager) add a new block. Having multiple blocks adds overhead, as pandas needs to check and consolidate them each time a row gets modified.

    With stmt3, you do not have this issue, as df.loc[index,"product"] = "x" is an in-place modification, that keeps the original, single block intact.

    stmt2 should be ignored (see note at the end). stmt4 is irrelevant, as the second block is created only after the for loop.


    Answer

    The difference in performance between your stmt1 and stmt3 has to do with the so-called BlockManager, which is an internal manager that tries to keep columns with compatible dtypes together as blocks in memory.

    • Initial situation: one block

    Useful information about the use of the BlockManager for a specific pd.DataFrame can be retrieved by accessing df._mgr. With your example:

    import pandas as pd
    
    num_cols = 3
    n_iter = 3
    extra_column = ["product"]
    columns = [chr(i+65) for i in range(num_cols)]
    index= range(n_iter)
    
    df = pd.DataFrame(index=index, columns=extra_column + columns)
    
      product    A    B    C
    0     NaN  NaN  NaN  NaN
    1     NaN  NaN  NaN  NaN
    2     NaN  NaN  NaN  NaN
    
    df._mgr
    
    BlockManager
    Items: Index(['product', 'A', 'B', 'C'], dtype='object')
    Axis 1: RangeIndex(start=0, stop=3, step=1)
    NumpyBlock: slice(0, 4, 1), 4 x 3, dtype: object    # all cols
    

    So, here we see that the BlockManager is working with a single block in memory.

    • stmt1: adding a new column / replacing one adds a block

    If now we use bracket notation ([]) to assign "x" to column "product", we are really re-creating that column. As a result, a second block is created:

    df["product"] = "x"
    print(df._mgr)
    
    BlockManager
    Items: Index(['product', 'A', 'B', 'C'], dtype='object')
    Axis 1: RangeIndex(start=0, stop=3, step=1)
    NumpyBlock: slice(1, 4, 1), 3 x 3, dtype: object    # cols "A, "B", "C" 
    NumpyBlock: slice(0, 1, 1), 1 x 3, dtype: object    # col "product"
    

    The important thing here is that this column is replacing the old column "product": it's a new column. E.g., if we use df.loc to create a new column, the same thing happens:

    df.loc[:, "new_col"] = "x"
    print(df._mgr)
    
    BlockManager
    Items: Index(['product', 'A', 'B', 'C', 'new_col'], dtype='object')
    Axis 1: RangeIndex(start=0, stop=3, step=1)
    NumpyBlock: slice(1, 4, 1), 3 x 3, dtype: object    # cols "A, "B", "C"
    NumpyBlock: slice(0, 1, 1), 1 x 3, dtype: object    # col "product"
    NumpyBlock: slice(4, 5, 1), 1 x 3, dtype: object    # col "new_col"
    
    • stmt3: in-place modification keeps block intact

    Here we see the difference with df.loc[index,"product"] = "x", because in this case we are not re-creating "product", we are simply updating its values. This does not create a new block:

    df = pd.DataFrame(index = index, columns=extra_column + columns)
    df.loc[index,"product"] = "x"
    
    print(df._mgr)
    
    BlockManager
    Items: Index(['product', 'A', 'B', 'C'], dtype='object')
    Axis 1: RangeIndex(start=0, stop=3, step=1)
    NumpyBlock: slice(0, 4, 1), 4 x 3, dtype: object    # "product" still here
    

    Key takeaways

    The upshot of all this for the different versions you use:

    • stmt1 with df["product"] = "x" internally has two blocks
    • stmt3 with df.loc[index,"product"] = "x" internally keeps one block
    • stmt4 with df["product"] = "x" after the for loop only has two blocks after that loop.

    The significant delay for stmt1 is caused by pandas needing to reconcile multiple blocks each time df.loc[i,columns] = 0 is executed in the loop. These internal checks trigger extra memory operations, as pandas must align modified rows across separate blocks. This results in a sizeable slowdown compared to the single-block df.

    Interestingly, df.copy leads to a reset of the blocks. Consequently, adding df = df.copy() gets the performance of stmt1 very close to stmt3 again:

    # adding: `df = df.copy()`
    stmt1 ="""
    df = pd.DataFrame(index = index, columns=extra_column + columns)
    df["product"] = "x"
    df = df.copy()
    for i in index:
        df.loc[i,columns] = 0
    """
    
    print(f" stmt1 takes { timeit.timeit(setup= setup_stmt, stmt= stmt1,  number=10):2.2f} seconds" )
    print(f" stmt3 takes { timeit.timeit(setup= setup_stmt, stmt= stmt3,  number=10):2.2f} seconds" )
    

    Prints:

     stmt1 takes 1.00 seconds
     stmt3 takes 0.99 seconds
    

    Further reading

    Some interesting reads on this complex topic and the difficulty of establishing its influence for specific use cases:

    There are plans to replace the BlockManager: see here, cf. here.


    A note on stmt2

    stmt2 should be ignored here, because it is not doing what you think it does. "dot notation" is a convenience feature that can provide attribute access to a df column. But this method comes with a few caveats. One being:

    The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed, but s['min'] is possible.

    This applies here, because df.product is a method of class pd.DataFrame. I.e., when you do df.product = "x", you are simply overwriting the method and storing the string "x" in its place:

    df = pd.DataFrame({"product": [1]})
    type(df.product)
    
    method
    
    df.product = "x"
    type(df.product)
    
    str
    

    I.e., we never updated the actual df:

       product
    0        1      # nothing changed