I am trying to perform CDC operation via Python. I am trying to perform union of the unchanged data (master file / base table) with the new file (delta file).
Below is the function I have written:
def processInputdata():
df1 = pd.read_csv('master.csv')
df2 = pd.read_csv('delta.csv')
df=pd.merge(df1,df2,on=['cust_id','cust_id'],how="outer",indicator=True)
dfo=df[df['_merge']=='left_only']
dfT =pd.merge(dfo,df2,on=['cust_id','cust_id'],how="right",indicator=True)
This is not working. Below is the error message:
ValueError: Cannot use name of an existing column for indicator column
I am not sure if there is any simpler or better approach to perform CDC.
Sample data :
Master file :
cust_id cust_name cust_income cust_phone
0 111 a 78000 sony
1 222 b 8000 jio
2 333 c 108000 iphone
3 444 d 200000 iphoneX
4 555 e 20000 samsung
Delta file :
cust_id cust_name cust_income cust_phone
0 222 b 20000 jio
1 333 c 120000 iphoneX
2 666 f 76000 oneplus
Expected output:
cust_id cust_name cust_income cust_phone
0 111 a 78000 sony
1 222 b 20000 jio
2 333 c 120000 iphoneX
3 444 d 200000 iphoneX
4 555 e 20000 samsung
5. 666 f 76000 oneplus
Using append
with drop_duplicates
with keep='last'
:
df = master.append(delta)\
.drop_duplicates(subset=['cust_id','cust_phone'], keep='last')\
.sort_values('cust_name').reset_index(drop=True)
cust_id cust_name cust_income cust_phone
0 111 a 78000 sony
1 222 b 8000 jio
2 333 c 108000 iphoneX
3 444 d 200000 iphoneX
4 555 e 20000 samsung
5 666 f 76000 oneplus