Search code examples
pythonpandasmultiple-columns

Pandas compare values of multiple columns


I want to find out if any of the value in columns mark1, mark2, mark3, mark4 and mark5 are the same, column-wise comparison from a dataframe below, and list result as True or False

import pandas as pd

df = pd.DataFrame(data=[[7, 2, 3, 7, 7], [3, 4, 3, 2, 7], [1, 6, 5, 2, 7], [5, 5, 6, 3, 1]],
                  columns=["mark1", "mark2", 'mark3', 'mark4', 'mark5'])

Ideal output:

   mark1  mark2  mark3  mark4  mark5          result
0      7      2      3      7      7            True
1      3      4      3      2      7            True
2      1      6      5      2      7           False
3      5      5      6      3      1            True

So I came up with a func using nested forloop to compare each value in a column, does not work. AttributeError: 'Series' object has no attribute 'columns' What's the correct way? Avoid nested forloop by all means.

def compare_col(df):
    check = 0
    for i in range(len(df.columns.tolist())+1):
        for j in range(1, len(df.columns.tolist())+1):
            if df.iloc[i, i] == df.iloc[j, i]:
                check += 1
    if check >= 1:
        return True
    else:
        return False

df['result'] = df.apply(lambda x: compare_col(x[['mark1', 'mark2', 'mark3', 'mark4', 'mark5]]), axis=1)

Solution

  • No need to use apply or a loop, compare the output of nunique to the number of columns:

    df['result'] = df.nunique(axis=1).ne(df.shape[1])
    

    Output:

       mark1  mark2  mark3  mark4  mark5  result
    0      7      2      3      7      7    True
    1      3      4      3      2      7    True
    2      1      6      5      2      7   False
    3      5      5      6      3      1    True
    

    If you want a more efficient method and assuming a reasonable number of columns (less than a thousand) and numbers, you could use to sort the values, compute the diff and check whether any value is 0:

    import numpy as np
    
    df['result'] = (np.diff(np.sort(df), axis=1)==0).any(axis=1)
    

    Output:

       mark1  mark2  mark3  mark4  mark5  result
    0      7      2      3      7      7    True
    1      3      4      3      2      7    True
    2      1      6      5      2      7   False
    3      5      5      6      3      1    True