I have a DataFrame that contains a list of unordered project IDs and their respective leaders. I want to compare a new dataframe's values with the ground truth to verify that all the project IDs in the new data frame are both; in the ground truth and have the same leader. Sometimes, the new DataFrame may have different column names but they are always two and always mean the same thing.
For example, the ground truth DataFrame (df_gt):
Project ID Leader
123 Owen Wilson
122 Samuel Jackson
145 Jack Black
134 Natalie Portman
While the new DataFrame might be (df_new):
Project Leader
123 Owen Wilson
122 Henry Cavil
144 Natalie Portman
146 Jack Black
My desired output would be a list of all the deviations from the ground truth that are found in the new DataFrame. In this case:
errors == [[122, Henry Cavil], [144, Natalie Portman], [146, Jack Black]]
what is the easiest way of doing this to accommodate multiple potential new dataframes each with their own naming conventions for the columns but the same intended values.
Another solution is using .merge
with indicator=
:
x = df1.merge(
df2,
left_on=["Project ID", "Leader"],
right_on=["Project", "Leader"],
indicator=True,
how="right",
)
errors = (
x.loc[x._merge.eq("right_only")]
.apply(lambda x: [x["Project"], x["Leader"]], axis=1)
.to_list()
)
print(errors)
Prints:
[[122, 'Henry Cavil'], [144, 'Natalie Portman'], [146, 'Jack Black']]