I want to find out if any of the value in columns mark1, mark2, mark3, mark4 and mark5 are the same, column-wise comparison from a dataframe below, and list result as True or False
import pandas as pd
df = pd.DataFrame(data=[[7, 2, 3, 7, 7], [3, 4, 3, 2, 7], [1, 6, 5, 2, 7], [5, 5, 6, 3, 1]],
columns=["mark1", "mark2", 'mark3', 'mark4', 'mark5'])
Ideal output:
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
So I came up with a func using nested forloop to compare each value in a column, does not work. AttributeError: 'Series' object has no attribute 'columns' What's the correct way? Avoid nested forloop by all means.
def compare_col(df):
check = 0
for i in range(len(df.columns.tolist())+1):
for j in range(1, len(df.columns.tolist())+1):
if df.iloc[i, i] == df.iloc[j, i]:
check += 1
if check >= 1:
return True
else:
return False
df['result'] = df.apply(lambda x: compare_col(x[['mark1', 'mark2', 'mark3', 'mark4', 'mark5]]), axis=1)
No need to use apply
or a loop, compare the output of nunique
to the number of columns:
df['result'] = df.nunique(axis=1).ne(df.shape[1])
Output:
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True
If you want a more efficient method and assuming a reasonable number of columns (less than a thousand) and numbers, you could use numpy to sort
the values, compute the diff
and check whether any value is 0
:
import numpy as np
df['result'] = (np.diff(np.sort(df), axis=1)==0).any(axis=1)
Output:
mark1 mark2 mark3 mark4 mark5 result
0 7 2 3 7 7 True
1 3 4 3 2 7 True
2 1 6 5 2 7 False
3 5 5 6 3 1 True