I have two data frames with three columns, with identical column names. I want to subtract the value of the third column where the values of the first, and second columns match. I've tried the following:
# Common column names
columns = ["month", "category", "sum"]
# First data frame
data1 = [("jan", "j", 10), ("feb", "f", 20)]
df1 = pd.DataFrame.from_records(data1, columns=columns)
# Second data frame
data2 = [("jan", "j", 9.5), ("mar", "m", 30)]
df2 = pd.DataFrame.from_records(data2, columns=columns)
print(df1) # Observe order of `month`s: jan, feb
print(df2) # Observe order of `month`s: jan, mar
# Subtract `sum` where `month`, and `category` match:
df1.set_index(["month", "category"]).subtract(df2.set_index(["month", "category"])).reset_index()
This yields the following output.
Observe that rows are sorted alphabetically on month
.
month category sum
0 feb f NaN
1 jan j 0.5
2 mar m NaN
How can I maintain the row order of the left hand operand? I.e. how can get the following output (or similar):
month category sum
1 jan j 0.5
0 feb f NaN
2 mar m NaN
pd.merge
will preserve the order of the left operand, then you can calculate the difference between two columns. You can do this for example:
df3 = pd.merge(df1, df2, on=["month", "category"], how="outer")
df3.loc[:, "difference"] = df3["sum_x"] - df3["sum_y"]
Which yields on your data:
month category sum_x sum_y difference
0 jan j 10.0 9.5 0.5
1 feb f 20.0 NaN NaN
2 mar m NaN 30.0 NaN