Many thanks for reading.
I have a pandas data frame which is the result of a concatenation of multiple smaller data frames. What I want to do is add multiple indicator columns to my final data frame, so that I can see what smaller data frame each row came from.
This would be my desired result:
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
jon smith 0 0 0 1
charlie jim 1 0 0 1
ian james 0 1 0 0
For example, "Jon Smith" came from data frame 4, and 'Charlie Jim" came from data frames 1 and 4 (duplicate rows).
I have been able to achieve this for rows that only came from one data frame (e.g. rows 1 and 3) but not for duplicate rows that came from multiple data frames (e.g. row 2).
Many thanks for any help.
You can use:
concat
with parameter key
s for identify DataFramesreset_index
for columns from MultiIndex
groupby
and join indicatorsstr.get_dummies
reindex
if need append 0 columns for missing categoriesreset_index
for columns from Index
df1 = pd.DataFrame({'Forename':['charlie'], 'Surname':['jim']})
df2 = pd.DataFrame({'Forename':['ian'], 'Surname':['james']})
df3 = pd.DataFrame()
df4 = pd.DataFrame({'Forename':['charlie', 'jon'], 'Surname':['jim', 'smith']})
#list of DataFrames
dfs = [df1, df2, df3, df4]
#generate indicators
inds = ['Ind_{}'.format(x+1) for x in range(len(dfs))]
df = (pd.concat(dfs, keys=inds)
.reset_index()
.groupby(['Forename','Surname'])['level_0']
.apply('|'.join)
.str.get_dummies()
.reindex(columns=inds, fill_value=0)
.reset_index())
print (df)
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
0 charlie jim 1 0 0 1
1 ian james 0 1 0 0
2 jon smith 0 0 0 1
More general solution with groupby
by all columns:
df = pd.concat(dfs, keys=inds)
print (df)
Forename Surname
Ind_1 0 charlie jim
Ind_2 0 ian james
Ind_4 0 charlie jim
1 jon smith
df1 =(df.reset_index()
.groupby(df.columns.tolist())['level_0']
.apply('|'.join)
.str.get_dummies()
.reindex(columns=inds, fill_value=0)
.reset_index())
print (df1)
Forename Surname Ind_1 Ind_2 Ind_3 Ind_4
0 charlie jim 1 0 0 1
1 ian james 0 1 0 0
2 jon smith 0 0 0 1