Search code examples
python-3.xpandasdataframegroup-by

Pandas groupby create a list of values from several columns


I have a dataframe

df = ID  C1  C2  C3  C4
     1   a   b   r    q
     1   b   e   g    h
     2   p   a   z    p
     1   r   a   n    m

I want to get list of values in C1, C2 per ID. So I will have:

out = ID  l 
      1  [a,b,r,e]
      2  [p,a]

What is the best way to do so?


Solution

  • Use lambda function with numpy.unique - output is sorted:

    df1 = (df.groupby('ID')[['C1','C2']].apply(lambda x: np.unique(x).tolist())
            .reset_index(name='l'))
    print (df1)
       ID             l
    0   1  [a, b, e, r]
    1   2        [a, p]
    

    Or with unique - output is with original order:

    df1 = (df.groupby('ID')[['C1','C2']].apply(lambda x: pd.unique(np.ravel(x)).tolist())
            .reset_index(name='l'))
    print (df1)
       ID             l
    0   1  [a, b, e, r]
    1   2        [p, a]
    

    Another idea with remove missing values and duplicates:

    df1 = (df.melt(id_vars='ID', value_vars=['C1','C2'], value_name='l')
             .drop_duplicates(['ID','l'])
             .dropna(subset=['l'])
             .groupby('ID')['l']
             .agg(list)
             .reset_index())
    print (df1)
       ID             l
    0   1  [a, b, r, e]
    1   2        [p, a]
    

    EDIT: For separate columns use GroupBy.agg:

    df2 = df.groupby('ID')[['C1','C2']].agg(lambda x: np.unique(x).tolist()).reset_index()
    print (df2)
       ID         C1         C2
    0   1  [a, b, r]  [a, b, e]
    1   2        [p]        [a]