I am working within Microsoft Azure Databricks with two Dataframes.
I have already a Dataframe which contains my "Masterdata". I am receiving daily also a full data package with "all" records. But those within this Dataframe there can be changes within records and records could be also deleted and added.
What is the best and maybe easiest way to get this delta or changset of data between two Dataframes?
UPDATE DataFrame 1 -> which i am getting every day
customer score
MERCEDES 1.1
CHRYSLER 3.0
DataFrame 2 -> my master
customer score
BMW 1.1
MERCEDES 1.3
So what do i need to get:
customer score
BMW 1.1 -> cause was deleted in receiving data
MERCEDES 1.3 -> cause was value changed
CHRYSLER 3.0 -> cause was added new
here is the merge function. See if it works for you.
import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({'customer':['MERCEDES','CHRYSLER'], 'score':[1.1, 3.0]})
df2 = pd.DataFrame({'customer':['BMW','MERCEDES'], 'score':[1.1, 1.3]})
df = pd.merge(df1, df2, on=['customer'], how='outer',indicator=True)
df
see the result: