Search code examples
pythonpandasappendconcatenation

Using pandas .append within for loop


I am appending rows to a pandas DataFrame within a for loop, but at the end the dataframe is always empty. I don't want to add the rows to an array and then call the DataFrame constructer, because my actual for loop handles lots of data. I also tried pd.concat without success. Could anyone highlight what I am missing to make the append statement work? Here's a dummy example:

import pandas as pd
import numpy as np

data = pd.DataFrame([])

for i in np.arange(0, 4):
    if i % 2 == 0:
        data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
    else:
        data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)

print data.head()

Empty DataFrame
Columns: []
Index: []
[Finished in 0.676s]

Solution

  • Every time you call append, Pandas returns a copy of the original dataframe plus your new row. This is called quadratic copy, and it is an O(N^2) operation that will quickly become very slow (especially since you have lots of data).

    In your case, I would recommend using lists, appending to them, and then calling the dataframe constructor.

    a_list = []
    b_list = []
    for data in my_data:
        a, b = process_data(data)
        a_list.append(a)
        b_list.append(b)
    df = pd.DataFrame({'A': a_list, 'B': b_list})
    del a_list, b_list
    

    Timings

    %%timeit
    data = pd.DataFrame([])
    for i in np.arange(0, 10000):
        if i % 2 == 0:
            data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
    else:
        data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)
    1 loops, best of 3: 6.8 s per loop
    
    %%timeit
    a_list = []
    b_list = []
    for i in np.arange(0, 10000):
        if i % 2 == 0:
            a_list.append(i)
            b_list.append(i + 1)
        else:
            a_list.append(i)
            b_list.append(None)
    data = pd.DataFrame({'A': a_list, 'B': b_list})
    100 loops, best of 3: 8.54 ms per loop