I'm trying to code in Python after a hiatus of 4 years and facing difficulties in implementing my ideas. I have 2 large excel sheets consisting of more than 12k x 70 values, both have the same number of columns and column names. I've created a primary key for both the dataframes and now I want to compare values from the primary key of one sheet to primary key of another and when there's a match I want to replace certain values(not all) from the old sheet to the new.
eg
df1:
A B C D E F
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
df2:
A B C D E F
19 20 21 22 23 6
25 26 27 28 29 12
31 32 33 34 35 19
When the operation is done, I need it to look something like this df1:
A B C D E F
1 2 21 22 23 6
7 8 27 28 29 12
13 14 15 16 17 18
The column F is the primary key
I tried the following but it's giving a very strange output
for i in df2['F']:
for j in df1['F']:
if i==j:
df1['D'] = df2['D'].astype(str)
df1['E'] = df2['E'].astype(str)
df1['C'] = df2['C'].astype(str)
Merge can be used for that:
new_df = df1.merge(df2, on='F')
Note that if both dataframes have the same column names, the new_df will contain new column names with '_x' and '_y' suffix.
After merging you can decide which columns to keep
To copy values only if they exist you could:
import numpy as np
new_df['A_x'] = np.where(new_df['A_y'].isnull(), new_df['A_x'], new_df['A_y'])
It will update the cell only if the new value is not null.
Explaining np.where: np.where(, , ) Basically I am checking if the value in the "new column" just merged is null. If it is null then "update" the old column with the current value in the column. If it is not null, then update the old column with the new value from the merged column