Search code examples
pythonpandasgroup-bycountunique

Count of unique records by multiple columns in a pandas df


I am trying to return a total count of unique records in a pandas df. This is determined by the values in Columns ['D','F']. So the there's specific places in column D. Column F displays how many events occur at those places. There can be multiple functions for each event but I'm only interested in how many total events there are at each place. So how many counts in Column F for each place in Column D

So the output for example below would be

ABC-Thu = 2

DEF-Thu = 1

import pandas as pd


d = ({
    'C' : ['08:00:00','XX','08:10:00','XX','08:41:42','XX','08:50:00','XX', '09:00:00', 'XX','09:15:00','XX','09:21:00','XX','09:30:00','XX','09:40:00','XX'],
    'D' : ['ABC-Thu','--','ABC-Thu','--','DEF-Thu','--','ABC-Thu','--','DEF-Thu','--','ABC-Thu','--','ABC-Thu','--','DEF-Thu','--','ABC-Thu','--'],
    'E' : ['Num:','','Num:','','Num:','','Num:','','Num:', '','Num:','','Num:','','Num:', '','Num:', ''],
    'F' : ['1','','1','','1','','1','','1', '','2','','2','','1', '','2',''],   
    'A' : ['A','','A','','A','','A','','A','','A','','A','','A','','A',''],           
    'B' : ['Stop','','Res','','Stop','','Start','','Res','','Stop','','Res','','Start','','Start','']
})

df = pd.DataFrame(data=d)

Output:

    A      B         C        D     E  F
0   A   Stop  08:00:00  ABC-Thu  Num:  1
1                   XX       --         
2   A    Res  08:10:00  ABC-Thu  Num:  1
3                   XX       --         
4   A   Stop  08:41:42  DEF-Thu  Num:  1
5                   XX       --         
6   A  Start  08:50:00  ABC-Thu  Num:  1
7                   XX       --         
8   A    Res  09:00:00  DEF-Thu  Num:  1
9                   XX       --         
10  A   Stop  09:15:00  ABC-Thu  Num:  2
11                  XX       --         
12  A    Res  09:21:00  ABC-Thu  Num:  2
13                  XX       --         
14  A  Start  09:30:00  DEF-Thu  Num:  1
15                  XX       --         
16  A  Start  09:40:00  ABC-Thu  Num:  2
17                  XX       --       

If I run this using the code below I get 4 because of the -- in Column D

df1 = df.groupby(['F', 'D']).ngroups

Output:

4

I was also wondering if there was an easier way to determine the total count for each place for column D, F. Rather then the overall count. This code is just an example. My dataset can have over 50 different places for column D with up 10 events for Column F. To add further constraints the places change for each file. So I need a quick way to determine the count for each meeting.

Is there a more efficient way than doing a count place by place. I'm not sure if I'm overthinking this or not. What's also confusing is the functions in Column B can vary. It's not always in a repeating order.


Solution

  • groupby and nunique

    df.groupby('D').F.nunique().drop('--')
    
    ABC-Thu    2
    DEF-Thu    1
    Name: F, dtype: int64