Search code examples
pythondataframelistjupyter-notebookspyder

How to return more than 1 of the same values in each row


I have a data frame with the same values in some of the rows. I would like to return the row number and values that has more than 1 of the same value starting from Num1 to Num7 in df.
Also would like to return the row number and same value. Either or I'm okay with.

For example

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,1,2,4,5,6,7,7],
            [2,5,6,7,22,23,34,48],
            [3,3,5,6,7,45,46,48],
            [4,6,7,14,29,32,6,29], # duplicate 6 and 29
            [5,6,7,13,23,33,35,7], # duplicate 7 but nothing else
            [6,1,6,7,8,9,10,8],
            [7,0,2,5,7,19,7,5]], # duplicate 7,5 
            columns = ['Row_Num', 'Num1','Num2','Num3','Num4','Num5','Num6','Num7'])

First result

df2 = pd.DataFrame([[4,6,7,14,29,32], # Return 6 and 29
                   [7,0,2,5,7,19]], # Return 7,5 
            columns = ['Row_Num', 'Num1','Num2','Num3','Num4','Num5'])

Second result

df3 = pd.DataFrame([[4,6,29], # Return 6 and 29
                   [7,7,5]], # Return 7,5 
            columns = ['Row_Num', 'Num1','Num2'])

Solution

  • IIUC you can use pd.duplicated with a some data manipulation to get your result:

    df = df.set_index('Row_Num') # set index
    df_duplicated = df.transform(lambda x: x.duplicated(), axis=1) # returns if rows have duplicates
    
    # First result where 
    res1 = df[df_duplicated.sum(axis=1) >= 2][~df_duplicated[df_duplicated.sum(axis=1) >= 2]].dropna(axis=1)
    
    # second result
    res2 = df[df_duplicated.sum(axis=1) >= 2][df_duplicated[df_duplicated.sum(axis=1) >= 2]].dropna(axis=1)
    
    

    Output:

    result1

             Num1  Num2  Num3  Num4  Num5
    Row_Num                              
    4           6     7    14    29    32
    7           0     2     5     7    19
    
    

    result2:

             Num6  Num7
    Row_Num            
    4           6    29
    7           7     5
    
    

    To exactly match your output just reset_index and rename column names for the second result.