Search code examples
pythondataframepysparkdatabricksdelta

Easiest way to get delta between two DataFrames


I am working within Microsoft Azure Databricks with two Dataframes.

I have already a Dataframe which contains my "Masterdata". I am receiving daily also a full data package with "all" records. But those within this Dataframe there can be changes within records and records could be also deleted and added.

What is the best and maybe easiest way to get this delta or changset of data between two Dataframes?

UPDATE DataFrame 1 -> which i am getting every day

customer  score
MERCEDES  1.1
CHRYSLER  3.0

DataFrame 2 -> my master

customer score
BMW       1.1
MERCEDES  1.3

So what do i need to get:

customer score
BMW       1.1    -> cause was deleted in receiving data
MERCEDES  1.3    -> cause was value changed
CHRYSLER  3.0    -> cause was added new

Solution

  • here is the merge function. See if it works for you.

    import pandas as pd
    from datetime import datetime
    
    df1 = pd.DataFrame({'customer':['MERCEDES','CHRYSLER'], 'score':[1.1, 3.0]})
    df2 = pd.DataFrame({'customer':['BMW','MERCEDES'], 'score':[1.1, 1.3]})
    
    df = pd.merge(df1, df2, on=['customer'], how='outer',indicator=True)
    df
    

    see the result:

    enter image description here