Search code examples
pythonpandasdataframepandas-loc

Grouping by each value in a column of a dataframe in python


I have a dataframe with 7 columns, as follows:

Bank Name | Number | Firstname | Lastname | ID | Date1    | Date2
B1        | 1      | ABC       | EFG      | 12 | Somedate | Somedate
B2        | 2      | ABC       | EFG      | 12 | Somedate | Somedate
B1        | 1      | DEF       | EFG      | 12 | Somedate | Somedate
B3        | 3      | ABC       | GHI      | 13 | Somedate | Somedate
B4        | 4      | XYZ       | FHJ      | 13 | Somedate | Somedate
B5        | 5      | XYZ       | DFI      | 13 | Somedate | Somedate

I want to create a tuple with 4 elements for each ID, such that each tuple element represents (Bank Name, Number, Firstname, Lastname) for each ID and the values of these tuple elements is the count of the distinct elements present in that respective column for that ID. For eg: For ID = 12, the tuple should be (2, 2, 2, 1) and for ID=13, the tuple should be (3, 3, 2, 3)

I'm able to get all rows for a particular ID by doing the following:

print(df.loc[df['ID'] == '12'])

But, I do not know how to do this for each value in the ID column (much like the group by clause in SQL, and also get the count instead of the actual values in the rows.

Please help.


Solution

  • Using apply you could do

    In [117]: cols = ['BankName', 'Number', 'Firstname', 'Lastname']
    
    In [126]: df.groupby('ID')[cols].nunique().apply(tuple, axis=1)
    Out[126]:
    ID
    12    (2, 2, 2, 1)
    13    (3, 3, 2, 3)
    dtype: object
    

    or,

    In [127]: df.groupby('ID').apply(lambda x: tuple(x[c].nunique() for c in cols))
    Out[127]:
    ID
    12    (2, 2, 2, 1)
    13    (3, 3, 2, 3)
    dtype: object
    

    Or, if you want as dataframe instead of tuple

    In [122]: df.groupby('ID').agg({c: 'nunique' for c in cols})
    Out[122]:
        Lastname  Number  Firstname  BankName
    ID
    12         1       2          2         2
    13         3       3          2         3
    
    or,
    
    In [123]: df.groupby('ID')[cols].nunique()
    Out[123]:
        BankName  Number  Firstname  Lastname
    ID
    12         2       2          2         1
    13         3       3          2         3