I am trying to concatenate two rows based on a categorical variable when the categories of the variable in the consecutive rows are same. Here is my data below for example :
SNo user Text
0 1 Sam Hello
1 1 John Hi
2 1 Sam How are you?
3 1 John I am good
4 1 John How about you?
5 1 John How is it going?
6 1 Sam Its going good
7 1 Sam Thanks
8 2 Mary Howdy?
9 2 Jake Hey!!
10 2 Jake What a surprise
11 2 Mary Good to see you here :)
12 2 Jake Ha ha. Hectic life
13 2 Mary I know right..
14 2 Mary How's Amy doing?
15 2 Mary How are the kids?
16 2 Jake All is good! :)
Here, if my previous value of user
column is same as my current value of user
column but different from the next value in that column, then, I'd concatenate the values from the column Text
for that user. I need to do this until there are no more multiple occurrence of that particular user. A sample output is given below :
SNo user Text
1 Sam Hello
1 John Hi
1 Sam How are you?
1 John I am good-How about you?-How is it going?
1 Sam Its going good-Thanks
2 Mary Howdy?
2 Jake Hey!!-What a surprise
2 Mary Good to see you here :)
2 Jake Ha ha. Hectic life
2 Mary I know right..-How's Amy doing?-How are the kids?
2 Jake All is good! :)
I tried using df.groupby()
and then .agg()
to get the concatenation done but unable to apply the above mentioned condition over it. Hence the output is combining all occurrences of an user for a chat.
df = sample_data.groupby(["SNo","user"]).agg({'Text': '-'.join}).reset_index() # incorrect though
df
Moreover, I am trying to avoid for
loop like a plague and trying out a vectorised solution.
Sample data :
data_dict = {'S. No.': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2], 'user': ['Sam', 'John', 'Sam', 'John', 'John', 'John', 'Sam', 'Sam', 'Mary', 'Jake', 'Jake', 'Mary', 'Jake ', 'Mary', 'Mary', 'Mary', 'Jake'], 'Text': ['Hello', 'Hi', 'How are you?', 'I am good', 'How about you?', 'How is it going?', 'Its going good', 'Thanks', 'Howdy?', 'Hey!!', 'What a surprise', 'Good to see you here :)', 'Ha ha. Hectic life', 'I know right..', "How's Amy doing?", 'How are the kids?', 'All is good! :)']}
sample_data = pd.DataFrame(data_dict)
You want to compare user
with its shift
and cumsum
for changes. Then you can groupby:
blocks = df['user'].ne(df['user'].shift()).cumsum()
(df.groupby(['SNo', blocks])
.agg({'user':'first','Text': '-'.join})
.reset_index('user', drop=True)
)
Output:
user Text
SNo
1 Sam Hello
1 John Hi
1 Sam How are you?
1 John I am good-How about you?-How is it going?
1 Sam Its going good-Thanks
2 Mary Howdy?
2 Jake Hey!!-What a surprise
2 Mary Good to see you here :)
2 Jake Ha ha. Hectic life
2 Mary I know right..-How's Amy doing?-How are the kids?
2 Jake All is good! :)