Search code examples
pythonpandasdataframeduplicates

Count the number of identical values ​in a DataFrame Python


I have a dataframe with two columns: hash and results.

The hash column contains the task ID, it can be repeated several times.

The result column contains the user's answer to this task. The answers contain specific string values ​​like: "Yes", "No", "Perhaps", 404"

My task is to calculate the consistency of user answers.

For example, if 3 users answered the same way to a task with one hash, then the consistency is 100%. If 2 out of 3 answered the same way, then 66.6. If one of 3, then 33.3%. If everyone answered differently, then 0.

I tried to calculate the maximum number of repeated answer values ​​for each hash via "duplicated", but this does not work when there are no repeated values.

Please tell me what functions can be used to solve this problem.

df for example:

df = pd.DataFrame({'hash': list('AAABBBCCCCDDD'),
               'results': ['Yes', 'No', 'No',
                           'Yes', 'Yes', 'Yes',
                           'Perhaps', 'Yes', '404', 'Perhaps',
                           'Yes', 'No', 'Perhaps',
                          ]
              })

Solution

  • Since you do not provide any reference code, I create a mini python script to simulate your code. This is what I used:

    First I create a function that can take the df and make all the operation

    def calculate_consistency(df):
    

    Then, first of all, I grouped all the records at the hash, and result columns:

    df.groupby('hash')['result'].value_counts().unstack(fill_value=0)
    
    • df.groupby('hash') this function creates a GroupBy object where all rows with the same 'hash' value are grouped together.
    • [result'] This selects only the 'result' column from each group. We're now working with groups of 'result' values for each unique 'hash'.
    • .value_counts() This method counts the occurrences of each unique value in the 'result' column
    • .unstack() This transforms the Series of Series (resulting from value_counts) into a DataFrame. Each unique 'result' value becomes a column. Each 'hash' becomes a row.
    • fill_value=0: This is an argument to unstack(). It fills any missing values (NaN) with 0.

    Then, calculate total responses for each hash

    total_responses = grouped.sum(axis=1)
    

    To find the maximum count for each hash

    max_counts = grouped.max(axis=1)
    

    And finally, calculate consistency percentage and return the consistency.

    consistency = (max_counts / total_responses) * 100
    return consistency
    

    The full code, with the example looks like this:

    import pandas as pd
    
    def calculate_consistency(df):
        grouped = df.groupby('hash')['result'].value_counts().unstack(fill_value=0)
        total_responses = grouped.sum(axis=1)
        max_counts = grouped.max(axis=1)
        consistency = (max_counts / total_responses) * 100
        return consistency
    
    
    df = pd.DataFrame({'hash': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'], 'result': ['Yes', 'Yes', 'No', 'Yes', 'No', 'Perhaps', '404', 'Yes']})
    
    result = calculate_consistency(df)
    print(result)
    

    Let me know if this is useful for you, good luck!