I have a data frame with the same values in some of the rows. I would like to return the row number and values that has more than 1 of the same value starting from Num1 to Num7 in df.
Also would like to return the row number and same value. Either or I'm okay with.
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,1,2,4,5,6,7,7],
[2,5,6,7,22,23,34,48],
[3,3,5,6,7,45,46,48],
[4,6,7,14,29,32,6,29], # duplicate 6 and 29
[5,6,7,13,23,33,35,7], # duplicate 7 but nothing else
[6,1,6,7,8,9,10,8],
[7,0,2,5,7,19,7,5]], # duplicate 7,5
columns = ['Row_Num', 'Num1','Num2','Num3','Num4','Num5','Num6','Num7'])
df2 = pd.DataFrame([[4,6,7,14,29,32], # Return 6 and 29
[7,0,2,5,7,19]], # Return 7,5
columns = ['Row_Num', 'Num1','Num2','Num3','Num4','Num5'])
df3 = pd.DataFrame([[4,6,29], # Return 6 and 29
[7,7,5]], # Return 7,5
columns = ['Row_Num', 'Num1','Num2'])
IIUC you can use pd.duplicated with a some data manipulation to get your result:
df = df.set_index('Row_Num') # set index
df_duplicated = df.transform(lambda x: x.duplicated(), axis=1) # returns if rows have duplicates
# First result where
res1 = df[df_duplicated.sum(axis=1) >= 2][~df_duplicated[df_duplicated.sum(axis=1) >= 2]].dropna(axis=1)
# second result
res2 = df[df_duplicated.sum(axis=1) >= 2][df_duplicated[df_duplicated.sum(axis=1) >= 2]].dropna(axis=1)
Output:
result1
Num1 Num2 Num3 Num4 Num5
Row_Num
4 6 7 14 29 32
7 0 2 5 7 19
result2
:
Num6 Num7
Row_Num
4 6 29
7 7 5
To exactly match your output just reset_index
and rename column names for the second result.