Search code examples
pythonpandasdataframesummarize

python pandas summarizing nominal variables (counting)


I have following data frame:

KEY PROD PARAMETER Y/N
1    AAA    PARAM1   Y
1    AAA    PARAM2   N
1    AAA    PARAM3   N
2    AAA    PARAM1   N
2    AAA    PARAM2   Y
2    AAA    PARAM3   Y
3    CCC    PARAM1   Y
3    CCC    PARAM2   Y
3    CCC    PARAM3   Y

I am interested in summarizing Y/N column values by PROD and PARAMETER columns and get the following output:

PROD  PARAM Y N
 AAA PARAM1 1 1
 AAA PARAM2 1 1
 AAA PARAM3 1 1
 CCC PARAM1 1 0
 CCC PARAM2 1 0
 CCC PARAM3 1 0

While Y and N values are counts of Y/N column values from the original data frame.


Solution

  • You could use pivot_table by creating an additional column with the value 1 as it doesn't matter either ways (You are only counting them)

    df['Y/Ncount'] = 1
    
    df = df.pivot_table(index=['PROD', 'PARAMETER'], columns=['Y/N'], values=['Y/Ncount'], 
                        aggfunc=sum, fill_value=0)
    
    df.columns = [col for col in df.columns.get_level_values(1)]
    df.reset_index()
    

    Image


    The simplest operation to use under this scenario would be crosstab which would produce the frequency counts of values present inside the Y/N column:

    pd.crosstab([df['PROD'], df['PARAMETER']], df['Y/N'])
    

    Image