Search code examples
pythonpandasduplicatesdata-cleaning

Finding the Location of the Duplicate for Duplicated Columns in Pandas


I know I can find duplicate columns using:

df.T.duplicated()

what I'd like to know the index that a duplicate column is a duplicate of. For example, both C and D are duplicates of a A below:

df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])

   A  B  C  D
0  1  0  1  1
1  2  0  2  2

I'd like something like:

duplicate_index = pd.Series([None, None, 'A', 'A'], ['A', 'B', 'C', 'D'])

Solution

  • I don't know if duplicated have an option to give information about the first row with the same data. My idea is by using groupby and transform such as:

    arr_first = (df.T.reset_index().groupby([col for col in df.T.columns])['index']
                    .transform(lambda x: x.iloc[0]).values)
    

    With your example, arr_first is then equal to array(['A', 'B', 'A', 'A'], dtype=object) and because they have the same order than df.columns, to get the expected output, you use np.where like:

    duplicate_index = pd.Series(pd.np.where(arr_first != df.columns, arr_first, None),df.columns)
    

    and the result for duplicate_index is

    A    None
    B    None
    C       A
    D       A
    dtype: object