Search code examples
pandasdataframegroup-bydaskdask-dataframe

Groupby keeping all the information of a database


let's say I have a dataframe A:

a0 a1    a2
1  1980  Y04
1  1983  Y08
1  1990  Y08
1  1980  Y08
1  1998  Y10
2  2003  Y02
3  1970  Y04
3  2009  Y10
3  1995  Y34

what I would like to do (in dask, but any suggestion, in Pandas as well, is helpful) is to make a groupby a0 retaining all the unique information contained in a1 and a2. In other words I would like to obtain the following:

a0  newa_1                        new_a2
1  [1980, 1983, 1990, 1998]  [Y04, Y08, Y10]
2  [2003]                          [Y02]
3  [1970,2009, 1995]               [Y04]

Thank you


Solution

  • Using the unique function and reset the index to get a dataframe.

    >>> df.groupby("a0").agg(newa_1=('a1',"unique"), new_a2=("a2","unique")).reset_index()
    
        a0  newa_1                      new_a2
    0   1   [1980, 1983, 1990, 1998]    [Y04, Y08, Y10]
    1   2   [2003]                      [Y02]
    2   3   [1970, 2009, 1995]          [Y04, Y10, Y34]