python pandas dataframe performance slice

pandas dataframe slicing performance is affected by how subset was previously assigned

In a recent post Pandas performance while iterating a state vector, I noticed a performance when slicing pandas dataframes that i do not understand.

The code presented here does not do anything usefull, but highlight the issue:

I create a dataframe with two areas of columns named extra_columns and columns
The part of the code which takes time to execute is the loop, where slices in columns are assigned.

What baffles me that the way i assign values to extra_columns before the loop affects the loop performance

Python code

import timeit

setup_stmt ="""
import pandas as pd
num_cols = 500
n_iter = 100
extra_column = [ "product"]
columns = [chr(i+65) for i in range(num_cols)]
index= range(n_iter)
"""

stmt1 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
df["product"] = "x"
for i in index:
    df.loc[i,columns] = 0
"""

stmt2 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
df.product = "x"            
for i in index:
    df.loc[i,columns] = 0
"""

stmt3 ="""
df = pd.DataFrame(index= index, columns=extra_column + columns)
df.loc[index,"product"] = "x"
for i in index:
    df.loc[i,columns] = 0
"""

stmt4 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
for i in index:
    df.loc[i,columns] = 0
df["product"] = "x"
"""

print(f" stmt1 takes { timeit.timeit(setup= setup_stmt, stmt= stmt1,  number=10):2.2f} seconds" )
print(f" stmt2 takes { timeit.timeit(setup= setup_stmt, stmt= stmt2,  number=10):2.2f} seconds" )
print(f" stmt3 takes { timeit.timeit(setup= setup_stmt, stmt= stmt3,  number=10):2.2f} seconds" )
print(f" stmt4 takes { timeit.timeit(setup= setup_stmt, stmt= stmt4,  number=10):2.2f} seconds" )

Output

 stmt1 takes 20.60 seconds
 stmt2 takes 0.46 seconds
 stmt3 takes 0.46 seconds
 stmt4 takes 0.46 seconds

Solution

TL;DR

Your original df has a single block in memory.

With stmt1, use of df["product"] = "x" makes the BlockManager (an internal memory manager) add a new block. Having multiple blocks adds overhead, as pandas needs to check and consolidate them each time a row gets modified.

With stmt3, you do not have this issue, as df.loc[index,"product"] = "x" is an in-place modification, that keeps the original, single block intact.

stmt2 should be ignored (see note at the end). stmt4 is irrelevant, as the second block is created only after the for loop.

Answer

The difference in performance between your stmt1 and stmt3 has to do with the so-called BlockManager, which is an internal manager that tries to keep columns with compatible dtypes together as blocks in memory.

Initial situation: one block

Useful information about the use of the BlockManager for a specific pd.DataFrame can be retrieved by accessing df._mgr. With your example:

import pandas as pd

num_cols = 3
n_iter = 3
extra_column = ["product"]
columns = [chr(i+65) for i in range(num_cols)]
index= range(n_iter)

df = pd.DataFrame(index=index, columns=extra_column + columns)

  product    A    B    C
0     NaN  NaN  NaN  NaN
1     NaN  NaN  NaN  NaN
2     NaN  NaN  NaN  NaN

df._mgr

BlockManager
Items: Index(['product', 'A', 'B', 'C'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumpyBlock: slice(0, 4, 1), 4 x 3, dtype: object    # all cols

So, here we see that the BlockManager is working with a single block in memory.

stmt1: adding a new column / replacing one adds a block

If now we use bracket notation ([]) to assign "x" to column "product", we are really re-creating that column. As a result, a second block is created:

df["product"] = "x"
print(df._mgr)

BlockManager
Items: Index(['product', 'A', 'B', 'C'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumpyBlock: slice(1, 4, 1), 3 x 3, dtype: object    # cols "A, "B", "C" 
NumpyBlock: slice(0, 1, 1), 1 x 3, dtype: object    # col "product"

The important thing here is that this column is replacing the old column "product": it's a new column. E.g., if we use df.loc to create a new column, the same thing happens:

df.loc[:, "new_col"] = "x"
print(df._mgr)

BlockManager
Items: Index(['product', 'A', 'B', 'C', 'new_col'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumpyBlock: slice(1, 4, 1), 3 x 3, dtype: object    # cols "A, "B", "C"
NumpyBlock: slice(0, 1, 1), 1 x 3, dtype: object    # col "product"
NumpyBlock: slice(4, 5, 1), 1 x 3, dtype: object    # col "new_col"

stmt3: in-place modification keeps block intact

Here we see the difference with df.loc[index,"product"] = "x", because in this case we are not re-creating "product", we are simply updating its values. This does not create a new block:

df = pd.DataFrame(index = index, columns=extra_column + columns)
df.loc[index,"product"] = "x"

print(df._mgr)

BlockManager
Items: Index(['product', 'A', 'B', 'C'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumpyBlock: slice(0, 4, 1), 4 x 3, dtype: object    # "product" still here

Key takeaways

The upshot of all this for the different versions you use:

stmt1 with df["product"] = "x" internally has two blocks
stmt3 with df.loc[index,"product"] = "x" internally keeps one block
stmt4 with df["product"] = "x" after the for loop only has two blocks after that loop.

The significant delay for stmt1 is caused by pandas needing to reconcile multiple blocks each time df.loc[i,columns] = 0 is executed in the loop. These internal checks trigger extra memory operations, as pandas must align modified rows across separate blocks. This results in a sizeable slowdown compared to the single-block df.

Interestingly, df.copy leads to a reset of the blocks. Consequently, adding df = df.copy() gets the performance of stmt1 very close to stmt3 again:

# adding: `df = df.copy()`
stmt1 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
df["product"] = "x"
df = df.copy()
for i in index:
    df.loc[i,columns] = 0
"""

print(f" stmt1 takes { timeit.timeit(setup= setup_stmt, stmt= stmt1,  number=10):2.2f} seconds" )
print(f" stmt3 takes { timeit.timeit(setup= setup_stmt, stmt= stmt3,  number=10):2.2f} seconds" )

Prints:

 stmt1 takes 1.00 seconds
 stmt3 takes 0.99 seconds

Further reading

Some interesting reads on this complex topic and the difficulty of establishing its influence for specific use cases:

Internal Structure of Pandas DataFrames, by Darius Kharazi (2020-05-15).
The one pandas internal I teach all my new colleagues: the BlockManager, blog by Uwe Korn (2020-05-24).

There are plans to replace the BlockManager: see here, cf. here.

A note on stmt2

stmt2 should be ignored here, because it is not doing what you think it does. "dot notation" is a convenience feature that can provide attribute access to a df column. But this method comes with a few caveats. One being:

The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed, but s['min'] is possible.

This applies here, because df.product is a method of class pd.DataFrame. I.e., when you do df.product = "x", you are simply overwriting the method and storing the string "x" in its place:

df = pd.DataFrame({"product": [1]})
type(df.product)

method

df.product = "x"
type(df.product)

str

I.e., we never updated the actual df:

   product
0        1      # nothing changed