Search code examples
pythonpandasdataframe

Efficient way to add rows to dataframe


From this question and others it seems that it is not recommended to use concat or append to build a pandas dataframe because it is recopying the whole dataframe each time.

My project involves retrieving a small amount of data every 30 seconds. This might run for a 3 day weekend, so someone could easily expect over 8000 rows to be created one row at a time. What would be the most efficient way to add rows to this dataframe?


Solution

  • Editing the chosen answer here since it was completely mistaken. What follows is an explanation of why you should not use setting with enlargement. "Setting with enlargement" is actually worse than append.

    The tl;dr here is that there is no efficient way to do this with a DataFrame, so if you need speed you should use another data structure instead. See other answers for better solutions.

    More on setting with enlargement

    You can add rows to a DataFrame in-place using loc on a non-existent index, but that also performs a copy of all of the data (see this discussion). Here's how it would look, from the Pandas documentation:

    In [119]: dfi
    Out[119]: 
       A  B  C
    0  0  1  0
    1  2  3  2
    2  4  5  4
    
    In [120]: dfi.loc[3] = 5
    
    In [121]: dfi
    Out[121]: 
       A  B  C
    0  0  1  0
    1  2  3  2
    2  4  5  4
    3  5  5  5
    

    For something like the use case described, setting with enlargement actually takes 50% longer than append:

    With append(), 8000 rows took 6.59s (0.8ms per row)

    %%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
    for i in range(8000):
        df = df.append(new_row, ignore_index=True)
    
    # 6.59 s ± 53.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    With .loc(), 8000 rows took 10s (1.25ms per row)

    %%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
    for i in range(8000):
        df.loc[i] = new_row
    
    # 10.2 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    What about a longer DataFrame?

    As with all profiling in data-oriented code, YMMV and you should test this for your use case. One characteristic of the copy-on-write behavior of append and "setting with enlargement" is that it will get slower and slower with large DataFrames:

    %%timeit df = pd.DataFrame(columns=["A", "B", "C"]); new_row = pd.Series({"A": 4, "B": 4, "C": 4})
    for i in range(16000):
        df.loc[i] = new_row
    
    # 23.7 s ± 286 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Building a 16k row DataFrame with this method takes 2.3x longer than 8k rows.