Search code examples
pythonpandasdataframecomparison

Pandas comparing a DataFrame against a ground truth DataFrame


I have a DataFrame that contains a list of unordered project IDs and their respective leaders. I want to compare a new dataframe's values with the ground truth to verify that all the project IDs in the new data frame are both; in the ground truth and have the same leader. Sometimes, the new DataFrame may have different column names but they are always two and always mean the same thing.

For example, the ground truth DataFrame (df_gt):

Project ID          Leader
123                Owen Wilson
122                Samuel Jackson
145                Jack Black
134                Natalie Portman

While the new DataFrame might be (df_new):

Project             Leader
123                Owen Wilson
122                Henry Cavil
144                Natalie Portman
146                Jack Black

My desired output would be a list of all the deviations from the ground truth that are found in the new DataFrame. In this case:

errors == [[122, Henry Cavil], [144, Natalie Portman], [146, Jack Black]]

what is the easiest way of doing this to accommodate multiple potential new dataframes each with their own naming conventions for the columns but the same intended values.


Solution

  • Another solution is using .merge with indicator=:

    x = df1.merge(
        df2,
        left_on=["Project ID", "Leader"],
        right_on=["Project", "Leader"],
        indicator=True,
        how="right",
    )
    
    errors = (
        x.loc[x._merge.eq("right_only")]
        .apply(lambda x: [x["Project"], x["Leader"]], axis=1)
        .to_list()
    )
    
    print(errors)
    

    Prints:

    [[122, 'Henry Cavil'], [144, 'Natalie Portman'], [146, 'Jack Black']]