Search code examples
pythonpandasapplypython-applymap

Pandas dataframe lambda function/applymap to combine multiple rows in a column and remove duplicates


How can I perform the following operations on a pandas dataframe?

  1. combine text from one column, multiple rows into one row
  2. remove duplicates in the "one row"
  3. repeat 1 & 2 for multiple columns

Based on the following Stack Overflow questions and answers, I have made the attempted code below. The last attempt is close, but I do not know how to convert the set back into a string (i.e., remove the braces) and roll it into a lambda function that I can use applymap() for multiple columns.

  1. How to combine multiple rows into a single row with pandas [duplicate]
  2. Concatenate strings from several rows using Pandas groupby
  3. Remove duplicates from rows and columns (cell) in a dataframe, python

Example Dataframe

id = [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4]
colA = ['type12', 'type11', 'type11', 'type11', 'type21', 'type21', 
        'type22', 'type23', 'type23', 'type23', 'type31', 'type31', 
        'type31', 'type31', 'type41', 'type41', 'type42', 'type41', 
        'type41', 'type43'
        ]
colB = ['Set A', 'Set B', 'Set B', 'Set B', 'Set B', 'Set B', 'Set A', 
        'Set B', 'Set C', 'Set C', 'Set B', 'Set C', 'Set B', 'Set C', 
        'Set B', 'Set B', 'Set A', 'Set C', 'Set B', 'Set A'
        ]
colC = ['alpha', 'beta', 'delta', 'charlie', 'beta', 'delta', 'alpha', 
        'charlie', 'charlie', 'delta', 'delta', 'charlie', 'beta', 
        'delta', 'beta', 'charlie', 'alpha', 'charlie', 'delta', 'alpha'
        ]
df = pd.DataFrame(list(zip(id, colA, colB, colC)), columns =['id', 'colA', 'colB', 'colC'])
print(df)

    id    colA   colB     colC
0    1  type12  Set A    alpha
1    1  type11  Set B     beta
2    1  type11  Set B    delta
3    1  type11  Set B  charlie
4    2  type21  Set B     beta
5    2  type21  Set B    delta
6    2  type22  Set A    alpha
7    2  type23  Set B  charlie
8    2  type23  Set C  charlie
9    2  type23  Set C    delta
10   3  type31  Set B    delta
11   3  type31  Set C  charlie
12   3  type31  Set B     beta
13   3  type31  Set C    delta
14   4  type41  Set B     beta
15   4  type41  Set B  charlie
16   4  type42  Set A    alpha
17   4  type41  Set C  charlie
18   4  type41  Set B    delta
19   4  type43  Set A    alpha

Desired Output

id  colA    colB           colC
1   type11  Set B          beta, delta, charlie
1   type12  Set A          alpha
2   type21  Set B          beta, delta
2   type22  Set A          alpha
2   type23  Set B, Set C   charlie, delta
3   type31  Set B, Set C   beta, delta, charlie
4   type41  Set B, Set C   beta, delta, charlie
4   type42  Set A          alpha
4   type43  Set A          alpha

Attempt for one column only that returns a string but still has duplicates

df2 = df.groupby(['id', 'colA'])['colB'].apply(', '.join).reset_index()
print(df2)
   id    colA                        colB
0   1  type11         Set B, Set B, Set B
1   1  type12                       Set A
2   2  type21                Set B, Set B
3   2  type22                       Set A
4   2  type23         Set B, Set C, Set C
5   3  type31  Set B, Set C, Set B, Set C
6   4  type41  Set B, Set B, Set C, Set B
7   4  type42                       Set A
8   4  type43                       Set A

Attempt for one column only that removes duplicates but returns a set

df2 = df.groupby(['id', 'colA'])['colB'].apply(list).apply(set).reset_index()
print(df2)
   id    colA            colB
0   1  type11         {Set B}
1   1  type12         {Set A}
2   2  type21         {Set B}
3   2  type22         {Set A}
4   2  type23  {Set B, Set C}
5   3  type31  {Set B, Set C}
6   4  type41  {Set B, Set C}
7   4  type42         {Set A}
8   4  type43         {Set A}

Solution

  • You can use a lambda in your groupby. drop_duplicates on the Series within the group then join the string. agg will work on all columns that aren't your grouping columns, or specify a subset.

    df.groupby(['id', 'colA']).agg(lambda x: ', '.join(x.drop_duplicates())).reset_index()
    

       id    colA          colB                  colC
    0   1  type11         Set B  beta, delta, charlie
    1   1  type12         Set A                 alpha
    2   2  type21         Set B           beta, delta
    3   2  type22         Set A                 alpha
    4   2  type23  Set B, Set C        charlie, delta
    5   3  type31  Set B, Set C  delta, charlie, beta
    6   4  type41  Set B, Set C  beta, charlie, delta
    7   4  type42         Set A                 alpha
    8   4  type43         Set A                 alpha