I have dataframe with unique identifier assigned based on three columns i.e., [col2,col3,col3]
Dataframe1:
col1 col2 col3 col4 col5 unique_id
1 abc bcv zxc www.com 8
2 bcd qwe rty www.@com 12
3 klp oiu ytr www.io 15
4 zxc qwe rty www.com 6
After data preprocessing, will import Dataframe_2 with same column values as shown above but without unique_id. Dataframe_2 rows must be assigned with unique identifier based on col2,col3,col4 and by referring to the Dataframe1.
If Dataframe_2 has new row which is not present in Dataframe1, then assign new identifier.
Dataframe_2:
col1 col2 col3 col4 col5
1 bcd qwe rty www.@com
2 zxc qwe rty www.com
3 abc bcv zxc www.com
4 kph hir mat www.com
Expected Dataframe_2:
col1 col2 col3 col4 col5 unique_id
1 bcd qwe rty www.@com 12
2 zxc qwe rty www.com 6
3 abc bcv zxc www.com 8
4 kph hir mat www.com 35
Since Row4 is not present in Dataframe1, a new unique identifier is assigned.
# assign the old unique_id
df2n = df2.join(df1.set_index(['col2', 'col3', 'col4', 'col5'])[['unique_id']],
on=['col2', 'col3', 'col4', 'col5'], how='left')
# assign new unique_id with max df1.unique_id + 1
id_max = df1.unique_id.max() + 1
null_num = df2n['unique_id'].isnull().sum()
cond = df2n['unique_id'].isnull()
df2n.loc[cond,'unique_id'] = range(id_max, id_max + null_num)
df2n['unique_id'] = df2n['unique_id'].astype(int)
print(df2n)
col1 col2 col3 col4 col5 unique_id
0 1 bcd qwe rty www.@com 12
1 2 zxc qwe rty www.com 6
2 3 abc bcv zxc www.com 8
3 4 kph hir mat www.com 16