Search code examples
pythonpandaspivot-tablepercentagesubtotal

How to add percentage by each category in Python pivot table?


I am working on a survey and the data looks like this:

ID    Q1    Q2    Q3    Gender    Age    Dep    Ethnicity
001   Y      N    Y      F         22     IT        W
002   N      Y    Y      M         35     HR        W
003   Y      N    N      F         20     IT        A
004   Y      N    Y      M         54     OPRE      B
005   Y      N    Y      M         42     OPRE      B

Now, I'd like to add two indexes Dep and Gender to create a table like:

Question   Dep   Response  ID    % response
     Q1     IT    Y        2     100
            IT    N        0     0
            HR    Y        0     0    
            HR    N        1     100  
            OPRE  Y        2     100 
            OPRE  N        0     0  
     Q2     IT    Y        0     0    
            IT    N        2     100
            HR    Y        1     100 
            HR    N        0     0 
            OPRE  Y        0     0
            OPRE  N        2     100 
     Q3 ......

My codes are like this:

df['% response'] = df['ID']/df['ID'].sum()

Which gives me

Question   Dep   Response  ID    % response
     Q1     IT    Y        2     20
            IT    N        0     0
            HR    Y        0     0    
            HR    N        1     10  
            OPRE  Y        2     20
            OPRE  N        0     0  
     Q2     IT    Y        0     0    
            IT    N        2     20
            HR    Y        1     10
            HR    N        0     0 
            OPRE  Y        0     0
            OPRE  N        2     20
     Q3 ......

I think the denominator is wrong. It should be grouped by the question and Dep and then do the count instead of sum all the IDs. Does anyone can help?


Solution

  • With your initial dataframe:

    import pandas as pd
    
    df = pd.DataFrame(
        {
            "ID": [1, 2, 3, 4, 5],
            "Q1": ["Y", "N", "Y", "Y", "Y"],
            "Q2": ["N", "Y", "N", "N", "N"],
            "Q3": ["Y", "Y", "N", "Y", "Y"],
            "Gender": ["F", "M", "F", "M", "M"],
            "Age": [22, 35, 20, 54, 42],
            "Dep": ["IT", "HR", "IT", "OPRE", "OPRE"],
            "Ethnicity": ["W", "W", "A", "B", "B"],
        }
    )
    

    Here is one way to do it:

    # Reshape the dataframe
    df = (
        pd.concat(
            [
                df[[col, "Dep", "ID"]]
                .rename(columns={col: "Response"})
                .assign(Question=col)
                for col in ("Q1", "Q2", "Q3")
            ]
        )
        .groupby(["Question", "Dep", "Response"])
        .count()
    )
    
    # Add 'zero' Y/N rows
    for question in ("Q1", "Q2", "Q3"):
        for dep in ("HR", "IT", "OPRE"):
            for response in ("Y", "N"):
                df = pd.concat([df, pd.DataFrame({(question, dep, response): {"ID": 0}}).T])
    
    # Remove duplicated rows and reindex
    df = df[~df.index.duplicated(keep="first")].sort_index(ascending=[True, True, False])
    
    # Add percentage column
    for question in ("Q1", "Q2", "Q3"):
        for dep in ("HR", "IT", "OPRE"):
            for response in ("Y", "N"):
                try:
                    df.loc[(question, dep, response), "% response"] = (
                        100
                        * df.loc[(question, dep, response), "ID"]
                        / df.loc[(question, dep)].sum().values[0]
                    )
                except KeyError:
                    pass
    
    # Cleanup
    df = df.astype(int)
    

    So that:

    print(df)
    # Output
                            ID  % response
    Question Dep  Response
    Q1       HR   Y          0           0
                  N          1         100
             IT   Y          2         100
                  N          0           0
             OPRE Y          2         100
                  N          0           0
    Q2       HR   Y          1         100
                  N          0           0
             IT   Y          0           0
                  N          2         100
             OPRE Y          0           0
                  N          2         100
    Q3       HR   Y          1         100
                  N          0           0
             IT   Y          1          50
                  N          1          50
             OPRE Y          2         100
                  N          0           0