Search code examples
pythonpython-3.xpandasdataframeperformance

Processing all cells in a 2d DataFrame is very slow - what am I missing?


I have a dataframe that comes from a csv file. I pivot this dataframe by two columns, and then attempt to process each cell.

In my actual code, which is of course much more complicated, there is some processing for each value. But I notice it is extremely slow. I then created a canonical example here to illustrate the issue where I don't do anything to the value - I just iterate through them and assign the same value back to each cell - which eliminates my processing as the issue and highlights the access / assignment as being slow:

import pandas as pd
from tqdm import tqdm


def test_speed():
    df = pd.read_csv('https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/customers/customers-100.csv')
    df = pd.pivot(df, index='Index', columns='Country')

    original = df.copy()

    inputs = df.columns.values

    # process each row at a time
    for row_index in tqdm(range(df.shape[0])): 
        row_data = df.iloc[row_index]  # get the row

        for input_idx in range(len(inputs)):
            # for each column label, get the value
            val = row_data[inputs[input_idx]]  

            # just assign them back to that cell 
            df.iloc[row_index, input_idx] = val  location

    print(f'Are they equal? {original.equals(df)}')


if __name__ == '__main__':
    test_speed()

I am getting this:

100%|██████████| 100/100 [00:04<00:00, 22.63it/s]
Are they equal? True

So, just 22 rows processed per second.

If I do the same processing iterating through the csv using a DictReader I get about 5000 rows per iteration.

I am obviously missing something important on how to access cells in a DataFrame.

My objective is to traverse all cells in this 2d pivot table and assign the value (possibly altered) back in an efficient way.

How can I speed up this code?

UPDATE: if I do the processing before pivot, it goes up to 1600 iterations / second. I think whenever I assign anything indexes are being rebuilt? I am never assigning values to the row / column I used to pivot, so this is surprising - any way to keep the pivot and the performance?


Solution

  • working on numpy array

    Extract all values from the dataframe (df.values) and work on the numpy array instead of dataframe.

    Less readable, much much faster.

    vectorization

    As other said in comments, using vectorized operations will make whole thing even faster, by removing need for python for loops. However, I am not sure if you need to improve performance further though (and “premature optimization is the root of all evil”)

    numba

    Sometimes vectorization is hard or even impossible to code using available pandas/numpy functions. In such cases it may be possible to use numba package. If the code you have is numba-compatible, you should get huge performance improvement.