Search code examples
pythonpandasnumpyperformance

Most efficient way of performing creation of new rows in a DataFrame


I'm implementing a data augmentation script that takes as input a pandas DataFrame and a list of strings (e.g. variations). The script should generate new rows for the DataFrame, where each row concatenates an element of variations.

For instance, having a DataFrame:

Compliment | Sentence_ID
Hi         | 1
Hello      | 2
Hola       | 3

And variations ["Elvis", "Monica"]

The resulting dataframe should be like this:

Compliment   | Sentence_ID
Hi           | 1
Hi Elvis     | 1
Hi Monica    | 1
Hello        | 2
Hello Elvis  | 2
Hello Monica | 2
Hola         | 3
Hola Elvis   | 3
Hola Monica  | 3

I made some tests with pd.iterrows() but it seems to be very slow (~5 minutes) when the dataframe is large. I'd like to know if there is such a more feasible option.


Solution

  • With pandas.DataFrame.explode:

    df['Compliment'] = df['Compliment'].apply(lambda x: [x] + [f"{x} {v}" for v in variations])
    df = df.explode('Compliment')
    

         Compliment  Sentence_ID
    0            Hi            1
    0      Hi Elvis            1
    0     Hi Monica            1
    1         Hello            2
    1   Hello Elvis            2
    1  Hello Monica            2
    2          Hola            3
    2    Hola Elvis            3
    2   Hola Monica            3