Search code examples
pythonpandasconcatenationindicator

Add indicator to inform where the data came from Python


Many thanks for reading.

I have a pandas data frame which is the result of a concatenation of multiple smaller data frames. What I want to do is add multiple indicator columns to my final data frame, so that I can see what smaller data frame each row came from.

This would be my desired result:

Forename   Surname   Ind_1    Ind_2    Ind_3    Ind_4

jon          smith     0        0         0        1
charlie      jim       1        0         0        1
ian          james     0        1         0        0

For example, "Jon Smith" came from data frame 4, and 'Charlie Jim" came from data frames 1 and 4 (duplicate rows).

I have been able to achieve this for rows that only came from one data frame (e.g. rows 1 and 3) but not for duplicate rows that came from multiple data frames (e.g. row 2).

Many thanks for any help.


Solution

  • You can use:


    df1 = pd.DataFrame({'Forename':['charlie'], 'Surname':['jim']})
    df2 = pd.DataFrame({'Forename':['ian'], 'Surname':['james']})
    df3 = pd.DataFrame()
    df4 = pd.DataFrame({'Forename':['charlie', 'jon'], 'Surname':['jim', 'smith']})
    
    #list of DataFrames
    dfs = [df1, df2, df3, df4]
    #generate indicators
    inds = ['Ind_{}'.format(x+1) for x in range(len(dfs))]
    df = (pd.concat(dfs, keys=inds)
           .reset_index()
           .groupby(['Forename','Surname'])['level_0']
           .apply('|'.join)
           .str.get_dummies()
           .reindex(columns=inds, fill_value=0)
           .reset_index())
    
    print (df)
      Forename Surname  Ind_1  Ind_2  Ind_3  Ind_4
    0  charlie     jim      1      0      0      1
    1      ian   james      0      1      0      0
    2      jon   smith      0      0      0      1
    

    More general solution with groupby by all columns:

    df = pd.concat(dfs, keys=inds)
    print (df)
            Forename Surname
    Ind_1 0  charlie     jim
    Ind_2 0      ian   james
    Ind_4 0  charlie     jim
          1      jon   smith
    
    df1 =(df.reset_index()
           .groupby(df.columns.tolist())['level_0']
           .apply('|'.join)
           .str.get_dummies()
           .reindex(columns=inds, fill_value=0)
           .reset_index())
    
    print (df1)
      Forename Surname  Ind_1  Ind_2  Ind_3  Ind_4
    0  charlie     jim      1      0      0      1
    1      ian   james      0      1      0      0
    2      jon   smith      0      0      0      1