I have two dataframes. Both have the same index and same column labels. I want to compare them by substring and cell location.
data_1 = {'column A': ['AA (1)', 'nan', 'nan', 'DD (6)', 'nan'],
'column B': ['AA (8)', 'nan', 'CC (3)', 'nan', 'nan'],
'column C': ['AA (9)', 'nan', 'CC (7)', 'DD (5)', 'nan']}
df_1 = pd.DataFrame(data_1)
data_2 = {'column A': ['AA', 'BB', 'nan', 'nan', 'nan'],
'column B': ['nan', 'nan', 'CC', 'nan', 'EE'],
'column C': ['nan', 'BB', 'CC', 'DD', 'nan']}
df_2 = pd.DataFrame(data_2)
The first dataframe contains the data I want to keep, given that the first two characters are the same and in the same location in the second dataframe. For example, compare 'column A' between both dataframes. Keep only 'AA (1)' in column one since the 'AA' substring is in both dataframes and in the same location, but drop 'DD (6)'.
Here is the desired output:
data_3 = {'column A': ['AA (1)', 'nan', 'nan', 'nan', 'nan'],
'column B': ['nan', 'nan', 'CC (3)', 'nan', 'nan'],
'column C': ['nan', 'nan', 'CC (7)', 'DD (5)', 'nan']}
df_3 = pd.DataFrame(data_3)
def match(df_11, df_2):
new_df = pd.DataFrame([])
concat_df = pd.concat([df_1, df_2], axis=1)
for i in df_1.columns:
new_df[i] = concat_df[i].apply(
lambda x: ['nan', x[0]][x[1].startswith(x[0][:2])], axis=1)
return new_df
df_3 = match(df_1, df_2)
print(df_3)