I'm implementing a data augmentation script that takes as input a pandas DataFrame and a list of strings (e.g. variations
). The script should generate new rows for the DataFrame, where each row concatenates an element of variations
.
For instance, having a DataFrame:
Compliment | Sentence_ID
Hi | 1
Hello | 2
Hola | 3
And variations ["Elvis", "Monica"]
The resulting dataframe should be like this:
Compliment | Sentence_ID
Hi | 1
Hi Elvis | 1
Hi Monica | 1
Hello | 2
Hello Elvis | 2
Hello Monica | 2
Hola | 3
Hola Elvis | 3
Hola Monica | 3
I made some tests with pd.iterrows()
but it seems to be very slow (~5 minutes) when the dataframe is large. I'd like to know if there is such a more feasible option.
With pandas.DataFrame.explode
:
df['Compliment'] = df['Compliment'].apply(lambda x: [x] + [f"{x} {v}" for v in variations])
df = df.explode('Compliment')
Compliment Sentence_ID
0 Hi 1
0 Hi Elvis 1
0 Hi Monica 1
1 Hello 2
1 Hello Elvis 2
1 Hello Monica 2
2 Hola 3
2 Hola Elvis 3
2 Hola Monica 3