Search code examples
pythonpandaslistdataframedata-manipulation

How to combine multiple DataFrame rows into 1 with a column containing list values


I have a DataFrame like this:

| cat0  | cat1 | cat2 | col_list  | Quantity |
| ----- | ---- | ---- | ----------| -------- |
| alpha | x    | a    | [a,b,c,d] | 4        |
| alpha | y    | a    | [e]       | 1        |
| beta  | z    | a    | [f,g]     | 2        |
| gamma | p    | b    | [h]       | 1        |
| gamma | q    | b    | [I,j,k,l] | 4        |
| phi   | r    | c    | [r,s]     | 2        |
| eita  | s    | c    | [m,n]     | 2        |

I want it to transform it based on cat2 column. If they have same value then combine those rows into 1 and merge the rows like this:

| cat0       | cat1  | cat2 | col_list        | Quantity |
| ---------- | ----- | ---- | ----------------| -------- |
| alpha,beta | x,y,z | a    | [a,b,c,d,e,f,g] | 7        |
| gamma      | p,q   | b    | [h,I,j,k,l]     | 5        |
| phi,eita   | r,s   | c    | [r,s,m,n]       | 4        |

If column have strings values (column: cat0,cat1) join and separate them by a comma. If it has list values (column" col_list) then add those elements into a single list. If it is a int value (column: Quantity), then just add them.


Solution

  • You can use groupby.agg, with list/','.join, sum, or itertools.chain:

    from itertools import chain
    
    (df
     .groupby('cat2', as_index=False)
     .agg({'cat0': ','.join, 'cat1': ','.join,
           'col_list': lambda x: list(chain.from_iterable(x)),
           'Quantity': 'sum'
           })
     [df.columns] # reindex like original
    )
    

    NB. in place or chain, you can also use: lambda x: [v for l in x for x in l]

    output:

                   cat0   cat1 cat2               col_list  Quantity
    0  alpha,alpha,beta  x,y,z    a  [a, b, c, d, e, f, g]         7
    1       gamma,gamma    p,q    b        [h, I, j, k, l]         5
    2          phi,eita    r,s    c           [r, s, m, n]         4