I have a dataframe that comes from a csv file. I pivot this dataframe by two columns, and then attempt to process each cell.
In my actual code, which is of course much more complicated, there is some processing for each value. But I notice it is extremely slow. I then created a canonical example here to illustrate the issue where I don't do anything to the value - I just iterate through them and assign the same value back to each cell - which eliminates my processing as the issue and highlights the access / assignment as being slow:
import pandas as pd
from tqdm import tqdm
def test_speed():
df = pd.read_csv('https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/customers/customers-100.csv')
df = pd.pivot(df, index='Index', columns='Country')
original = df.copy()
inputs = df.columns.values
# process each row at a time
for row_index in tqdm(range(df.shape[0])):
row_data = df.iloc[row_index] # get the row
for input_idx in range(len(inputs)):
# for each column label, get the value
val = row_data[inputs[input_idx]]
# just assign them back to that cell
df.iloc[row_index, input_idx] = val location
print(f'Are they equal? {original.equals(df)}')
if __name__ == '__main__':
test_speed()
I am getting this:
100%|██████████| 100/100 [00:04<00:00, 22.63it/s]
Are they equal? True
So, just 22 rows processed per second.
If I do the same processing iterating through the csv using a DictReader
I get about 5000 rows per iteration.
I am obviously missing something important on how to access cells in a DataFrame
.
My objective is to traverse all cells in this 2d pivot table and assign the value (possibly altered) back in an efficient way.
How can I speed up this code?
UPDATE: if I do the processing before pivot, it goes up to 1600 iterations / second. I think whenever I assign anything indexes are being rebuilt? I am never assigning values to the row / column I used to pivot, so this is surprising - any way to keep the pivot and the performance?
Extract all values from the dataframe (df.values
) and work on the numpy array instead of dataframe.
Less readable, much much faster.
As other said in comments, using vectorized operations will make whole thing even faster, by removing need for python for loops. However, I am not sure if you need to improve performance further though (and “premature optimization is the root of all evil”)
Sometimes vectorization is hard or even impossible to code using available pandas/numpy functions. In such cases it may be possible to use numba
package. If the code you have is numba-compatible, you should get huge performance improvement.