I have the following code:
import pandas as pd
data = {
'Col1': ['John 1', 'John 2', 'John 3', 'Kyle 1', 'Kyle 3', 'Kyle 2'],
'Col2': ['B', 'C', 'E', 'F', 'F', 'S'],
'Col3': ['1', '1', '1', '1', '1', '2']
}
df = pd.DataFrame(data)
print(df)
Which gives this dataframe in which the numbers in Col1 (column 1) repeat only twice (e.g. John 1 and Kyle 1):
Col1 Col2 Col3
0 John 1 B 1
1 John 2 C 1
2 John 3 E 1
3 Kyle 1 F 1
4 Kyle 3 F 1
5 Kyle 2 S 2
I want to make a function that iterates through the names of Col1 and duplicates the row with same number to be beside the original row. My desired output should be something like this:
Col1 Col2 Col3 New Col1 New Col2 New Col3
0 John 1 B 1 Kyle 1 F 1
1 John 2 C 1 Kyle 2 S 2
2 John 3 E 1 Kyle 3 F 1
3 Kyle 1 F 1 John 1 B 1
4 Kyle 3 F 1 John 3 E 1
5 Kyle 2 S 2 John 2 C 1
Notice how all the values in the original Kyle 1 (row 3) have been duplicated since the '1' matches with 'John 1' in row 0. I have tried using pd.concat within nested for loops but I can't get it to work.
Make self-merge on the numeric suffix of Col1
and filter out duplicate entries.
The column names are arranged with df.add_suffix
:
res = (df.merge(df.add_suffix('_new'),
on=df['Col1'].str.replace(r'\D+', '', regex=True), how='left')
.drop('key_0', axis=1))
res = res[res['Col1'] != res['Col1_new']]
print(res)
Col1 Col2 Col3 Col1_new Col2_new Col3_new
1 John 1 B 1 Kyle 1 F 1
3 John 2 C 1 Kyle 2 S 2
5 John 3 E 1 Kyle 3 F 1
6 Kyle 1 F 1 John 1 B 1
8 Kyle 3 F 1 John 3 E 1
10 Kyle 2 S 2 John 2 C 1