How can I perform the following operations on a pandas dataframe?
Based on the following Stack Overflow questions and answers, I have made the attempted code below. The last attempt is close, but I do not know how to convert the set back into a string (i.e., remove the braces) and roll it into a lambda function that I can use applymap() for multiple columns.
Example Dataframe
id = [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4]
colA = ['type12', 'type11', 'type11', 'type11', 'type21', 'type21',
'type22', 'type23', 'type23', 'type23', 'type31', 'type31',
'type31', 'type31', 'type41', 'type41', 'type42', 'type41',
'type41', 'type43'
]
colB = ['Set A', 'Set B', 'Set B', 'Set B', 'Set B', 'Set B', 'Set A',
'Set B', 'Set C', 'Set C', 'Set B', 'Set C', 'Set B', 'Set C',
'Set B', 'Set B', 'Set A', 'Set C', 'Set B', 'Set A'
]
colC = ['alpha', 'beta', 'delta', 'charlie', 'beta', 'delta', 'alpha',
'charlie', 'charlie', 'delta', 'delta', 'charlie', 'beta',
'delta', 'beta', 'charlie', 'alpha', 'charlie', 'delta', 'alpha'
]
df = pd.DataFrame(list(zip(id, colA, colB, colC)), columns =['id', 'colA', 'colB', 'colC'])
print(df)
id colA colB colC
0 1 type12 Set A alpha
1 1 type11 Set B beta
2 1 type11 Set B delta
3 1 type11 Set B charlie
4 2 type21 Set B beta
5 2 type21 Set B delta
6 2 type22 Set A alpha
7 2 type23 Set B charlie
8 2 type23 Set C charlie
9 2 type23 Set C delta
10 3 type31 Set B delta
11 3 type31 Set C charlie
12 3 type31 Set B beta
13 3 type31 Set C delta
14 4 type41 Set B beta
15 4 type41 Set B charlie
16 4 type42 Set A alpha
17 4 type41 Set C charlie
18 4 type41 Set B delta
19 4 type43 Set A alpha
Desired Output
id colA colB colC
1 type11 Set B beta, delta, charlie
1 type12 Set A alpha
2 type21 Set B beta, delta
2 type22 Set A alpha
2 type23 Set B, Set C charlie, delta
3 type31 Set B, Set C beta, delta, charlie
4 type41 Set B, Set C beta, delta, charlie
4 type42 Set A alpha
4 type43 Set A alpha
Attempt for one column only that returns a string but still has duplicates
df2 = df.groupby(['id', 'colA'])['colB'].apply(', '.join).reset_index()
print(df2)
id colA colB
0 1 type11 Set B, Set B, Set B
1 1 type12 Set A
2 2 type21 Set B, Set B
3 2 type22 Set A
4 2 type23 Set B, Set C, Set C
5 3 type31 Set B, Set C, Set B, Set C
6 4 type41 Set B, Set B, Set C, Set B
7 4 type42 Set A
8 4 type43 Set A
Attempt for one column only that removes duplicates but returns a set
df2 = df.groupby(['id', 'colA'])['colB'].apply(list).apply(set).reset_index()
print(df2)
id colA colB
0 1 type11 {Set B}
1 1 type12 {Set A}
2 2 type21 {Set B}
3 2 type22 {Set A}
4 2 type23 {Set B, Set C}
5 3 type31 {Set B, Set C}
6 4 type41 {Set B, Set C}
7 4 type42 {Set A}
8 4 type43 {Set A}
You can use a lambda
in your groupby. drop_duplicates
on the Series within the group then join the string. agg
will work on all columns that aren't your grouping columns, or specify a subset.
df.groupby(['id', 'colA']).agg(lambda x: ', '.join(x.drop_duplicates())).reset_index()
id colA colB colC
0 1 type11 Set B beta, delta, charlie
1 1 type12 Set A alpha
2 2 type21 Set B beta, delta
3 2 type22 Set A alpha
4 2 type23 Set B, Set C charlie, delta
5 3 type31 Set B, Set C delta, charlie, beta
6 4 type41 Set B, Set C beta, charlie, delta
7 4 type42 Set A alpha
8 4 type43 Set A alpha