Search code examples
pythonpandaspivot

Count unique value with prioritize value in pandas


I have a simple data frame as below:

import pandas as pd
import numpy as np
df = pd.DataFrame({'CUS_NO': ['900636229', '900636229', '900636080', '900636080', '900636052', '900636052', 
                              '900636053', '900636054', '900636055', '900636056'], 
                   'indicator': ['both', 'left_only', 'both', 'left_only', 'both', 'left_only', 
                                 'both', 'left_only', 'both', 'left_only'],
                   'Nationality': ['VN', 'VN', 'KR', 'KR', 'VN', 'VN', 
                                   'KR', 'VN', 'KR', 'VN']})

        CUS_NO      indicator   Nationality
0       900636229   both        VN
1       900636229   left_only   VN
2       900636080   both        KR
3       900636080   left_only   KR
4       900636052   both        VN
5       900636052   left_only   VN
6       900636053   both        KR
7       900636054   left_only   VN
8       900636055   both        KR
9       900636056   left_only   VN

I want to count unique value of CUS_NO so I used pd.Series.nunique by below code:

df2 = pd.pivot_table(df, values='CUS_NO', 
                     index='Nationality', 
                     columns='indicator', 
                     aggfunc=pd.Series.nunique, 
                     margins=True).reset_index()
df2

And here is the result:

indicator   Nationality both    left_only   All
0           KR          3       1           3
1           VN          2       4           4
2           All         5       5           7

But I my expectation is if CUS_NO was same and indicator was different, I just need to count both indicator. So below is my expected Output:

indicator   Nationality both    left_only   All
0           KR          3       0           3
1           VN          2       2           4
2           All         5       2           7

Thank you.


Solution

  • You can sort_values to have "both" on top (if more categories, use a Categorical to define a custom order), then drop_duplicates:

    tmp = (df
       .sort_values(by='indicator')
       .drop_duplicates(subset=['CUS_NO', 'Nationality'], keep='first')
    )
    
    df2 = pd.pivot_table(tmp, values='CUS_NO', 
                         index='Nationality', 
                         columns='indicator', 
                         aggfunc=pd.Series.nunique, 
                         margins=True,
                         fill_value=0).reset_index()
    

    Output:

    indicator Nationality  both  left_only  All
    0                  KR     3          0    3
    1                  VN     2          2    4
    2                 All     5          2    7
    

    Intermediate tmp:

          CUS_NO  indicator Nationality
    0  900636229       both          VN
    2  900636080       both          KR
    4  900636052       both          VN
    6  900636053       both          KR
    8  900636055       both          KR
    7  900636054  left_only          VN
    9  900636056  left_only          VN