I have two pandas dataframes which represent a directory structure with file hashes like
import pandas as pd
dir_old = pd.DataFrame([
{"Filepath": "dir1/file1", "Hash": "hash1"},
{"Filepath": "dir1/file2", "Hash": "hash2"},
{"Filepath": "dir2/file3", "Hash": "hash3"},
])
dir_new = pd.DataFrame([
# {"Filepath": "dir1/file1", "Hash": "hash1"}, # deleted file
{"Filepath": "dir1/file2", "Hash": "hash2"},
{"Filepath": "dir2/file3", "Hash": "hash5"}, # changed file
{"Filepath": "dir1/file4", "Hash": "hash4"}, # new file
])
The dir_new
shows the content of the directory structure after some changes. To compare these two dataframes I use
df_merged = pd.merge(dir_new, dir_old, how='outer', indicator=True)
print(df_merged)
This will return
Filepath Hash _merge
0 dir1/file1 hash1 right_only
1 dir1/file2 hash2 both
2 dir1/file4 hash4 left_only
3 dir2/file3 hash3 right_only
4 dir2/file3 hash5 left_only
It is easy to identify the right_only
rows as deleted
, both
as unchanged and left_only
as new files. However what to do about the modified file dir/file3
which appears twice as right_only
and left_only
? I did the following:
# The indicator columns _merge has categorical values.
# We need to convert it to string to be able to add later a new value `modified`
df_merged["State"] = df_merged["_merge"].astype(str)
df_merged = df_merged.drop(columns=["_merge"])
# Identify the rows with duplicated filepath and only keep the new (left_only) ones
modified = df_merged[df_merged.duplicated(subset=["Filepath"], keep=False)]
keep = modified[modified["State"] == "left_only"]
drop = modified[modified["State"] == "right_only"]
# Rename the state of the new modified files to `changed` and drop the old duplicated row
df_merged.iloc[keep.index, df_merged.columns.get_loc("State")] = "changed"
df_dropped = df_merged.drop(drop.index)
# Finally rename the State for all the remaining rows
df_final = df_dropped.replace(to_replace=["right_only", "left_only", "both"],
value=["deleted", "created", "equal"]).reset_index(drop=True)
print(df_final)
The output is
Filepath Hash State
0 dir1/file1 hash1 deleted
1 dir1/file2 hash2 equal
2 dir1/file4 hash4 created
3 dir2/file3 hash5 changed
So it works. But is strikes me as a very complicated solution. Is there maybe a smarter way create a diff between these two dataframes and especially to identify the modified rows between dir_old
and dir_new
?
I'd do it first by merging only by Filepath
and then compare Hash_x
/Hash_y
and indicator accordingly (seems straightforward to me):
df = dir_new.merge(dir_old, on="Filepath", how="outer", indicator=True)
hash_changed = df["Hash_x"] != df["Hash_y"]
deleted = df["_merge"] == "right_only"
created = df["_merge"] == "left_only"
changed = (df["_merge"] == "both") & hash_changed
unchanged = (df["_merge"] == "both") & ~hash_changed
df.loc[deleted, "Status"] = "deleted"
df.loc[created, "Status"] = "created"
df.loc[changed, "Status"] = "changed"
df.loc[unchanged, "Status"] = "unchanged"
df["Hash"] = df[["Hash_x", "Hash_y"]].bfill(axis=1)["Hash_x"]
print(df[["Filepath", "Hash", "Status"]])
Prints:
Filepath Hash Status
0 dir1/file1 hash1 deleted
1 dir1/file2 hash2 unchanged
2 dir1/file4 hash4 created
3 dir2/file3 hash5 changed
EDIT: Example of dynamic column names:
df = dir_new.merge(dir_old, on="Filepath", how="outer", indicator=True)
column_names = dir_new.columns.difference(["Filepath"])
hash_changed = df["Hash_x"] != df["Hash_y"]
deleted = df["_merge"] == "right_only"
created = df["_merge"] == "left_only"
changed = (df["_merge"] == "both") & hash_changed
unchanged = (df["_merge"] == "both") & ~hash_changed
df.loc[deleted, "Status"] = "deleted"
df.loc[created, "Status"] = "created"
df.loc[changed, "Status"] = "changed"
df.loc[unchanged, "Status"] = "unchanged"
for c in column_names:
df[c] = df[[f"{c}_x", f"{c}_y"]].bfill(axis=1)[f"{c}_x"]
print(df[["Filepath", *column_names, "Status"]])