Add indicator to inform where the data came from Python

Many thanks for reading.

I have a pandas data frame which is the result of a concatenation of multiple smaller data frames. What I want to do is add multiple indicator columns to my final data frame, so that I can see what smaller data frame each row came from.

This would be my desired result:

Forename   Surname   Ind_1    Ind_2    Ind_3    Ind_4

jon          smith     0        0         0        1
charlie      jim       1        0         0        1
ian          james     0        1         0        0

For example, "Jon Smith" came from data frame 4, and 'Charlie Jim" came from data frames 1 and 4 (duplicate rows).

I have been able to achieve this for rows that only came from one data frame (e.g. rows 1 and 3) but not for duplicate rows that came from multiple data frames (e.g. row 2).

Many thanks for any help.

Solution

You can use:

first concat with parameter keys for identify DataFrames
reset_index for columns from MultiIndex
groupby and join indicators
create indicators by str.get_dummies
reindex if need append 0 columns for missing categories
reset_index for columns from Index

df1 = pd.DataFrame({'Forename':['charlie'], 'Surname':['jim']})
df2 = pd.DataFrame({'Forename':['ian'], 'Surname':['james']})
df3 = pd.DataFrame()
df4 = pd.DataFrame({'Forename':['charlie', 'jon'], 'Surname':['jim', 'smith']})

#list of DataFrames
dfs = [df1, df2, df3, df4]
#generate indicators
inds = ['Ind_{}'.format(x+1) for x in range(len(dfs))]
df = (pd.concat(dfs, keys=inds)
       .reset_index()
       .groupby(['Forename','Surname'])['level_0']
       .apply('|'.join)
       .str.get_dummies()
       .reindex(columns=inds, fill_value=0)
       .reset_index())

print (df)
  Forename Surname  Ind_1  Ind_2  Ind_3  Ind_4
0  charlie     jim      1      0      0      1
1      ian   james      0      1      0      0
2      jon   smith      0      0      0      1

More general solution with groupby by all columns:

df = pd.concat(dfs, keys=inds)
print (df)
        Forename Surname
Ind_1 0  charlie     jim
Ind_2 0      ian   james
Ind_4 0  charlie     jim
      1      jon   smith

df1 =(df.reset_index()
       .groupby(df.columns.tolist())['level_0']
       .apply('|'.join)
       .str.get_dummies()
       .reindex(columns=inds, fill_value=0)
       .reset_index())

print (df1)
  Forename Surname  Ind_1  Ind_2  Ind_3  Ind_4
0  charlie     jim      1      0      0      1
1      ian   james      0      1      0      0
2      jon   smith      0      0      0      1