I have two pandas dataframes to merge
with interpolate()
on the missing numeric data and with ffill()
on boolean data and dates. The source data looks like this:
df1
Date Unemployment Rate Federal Debt
0 1969-01-01 3.4 36.19577
1 1969-04-01 3.4 34.97403
2 1969-07-01 3.5 35.01946
df2
President Start End Republican Democrat
1969-01-20 Nixon 1969-01-20 1974-08-09 True False
1974-08-09 Ford 1974-08-09 1977-01-20 True False
1977-01-20 Carter 1977-01-20 1981-01-20 False True
The desired output is something like:
df3
Date Unemployment Rate Federal Debt President Start End Republican Democrat
1969-01-01 3.4 36.19577 NaN NaN NaN NaN NaN
1969-01-20 3.4 35.93785 Nixon 1969-01-20 1974-08-09 True False
1969-04-01 3.4 34.97403 Nixon 1969-01-20 1974-08-09 True False
1969-07-01 3.5 35.01946 Nixon 1969-01-20 1974-08-09 True False
...
1977-01-01 7.5 33.65136 Ford 1974-08-09 1977-01-20 True False
1977-01-20 7.4 33.47252 Carter 1977-01-20 1981-01-20 False True
1977-04-01 7.2 32.80422 Carter 1977-01-20 1981-01-20 False True
In words, I have economic data that comes either at daily, monthly, quarterly, or annual rate and political data on the presidential term dates. The purpose of merging the dataframes is to compute statistics for each president's term and to get the (interpolated) data for the start/end of their terms, and eventually to graph this.
My difficulties seem to be mostly about merging two datasets. I show below how far I got.
import pandas as pd
from pandas import Timestamp
import numpy as np # np.nan
df1 = pd.DataFrame.from_dict({'Date': {0: '1969-01-01', 1: '1969-04-01', 2: '1969-07-01', 3: '1969-10-01', 4: '1970-01-01', 5: '1970-04-01', 6: '1970-07-01', 7: '1970-10-01', 8: '1971-01-01', 9: '1971-04-01', 10: '1971-07-01', 11: '1971-10-01', 12: '1972-01-01', 13: '1972-04-01', 14: '1972-07-01', 15: '1972-10-01', 16: '1973-01-01', 17: '1973-04-01', 18: '1973-07-01', 19: '1973-10-01', 20: '1974-01-01', 21: '1974-04-01', 22: '1974-07-01', 23: '1974-10-01', 24: '1975-01-01', 25: '1975-04-01', 26: '1975-07-01', 27: '1975-10-01', 28: '1976-01-01', 29: '1976-04-01', 30: '1976-07-01', 31: '1976-10-01', 32: '1977-01-01', 33: '1977-04-01', 34: '1977-07-01', 35: '1977-10-01', 36: '1978-01-01', 37: '1978-04-01', 38: '1978-07-01', 39: '1978-10-01', 40: '1979-01-01', 41: '1979-04-01', 42: '1979-07-01', 43: '1979-10-01'}, 'Unemployment Rate': {0: 3.4, 1: 3.4, 2: 3.5, 3: 3.7, 4: 3.9, 5: 4.6, 6: 5.0, 7: 5.5, 8: 5.9, 9: 5.9, 10: 6.0, 11: 5.8, 12: 5.8, 13: 5.7, 14: 5.6, 15: 5.6, 16: 4.9, 17: 5.0, 18: 4.8, 19: 4.6, 20: 5.1, 21: 5.1, 22: 5.5, 23: 6.0, 24: 8.1, 25: 8.8, 26: 8.6, 27: 8.4, 28: 7.9, 29: 7.7, 30: 7.8, 31: 7.7, 32: 7.5, 33: 7.2, 34: 6.9, 35: 6.8, 36: 6.4, 37: 6.1, 38: 6.2, 39: 5.8, 40: 5.9, 41: 5.8, 42: 5.7, 43: 6.0}, 'Federal Debt': {0: 36.19577, 1: 34.97403, 2: 35.01946, 3: 35.46954, 4: 35.38879, 5: 34.67329, 6: 34.86717, 7: 35.74822, 8: 34.50345, 9: 34.36089, 10: 35.00694, 11: 35.63237, 12: 34.72622, 13: 33.67383, 14: 33.62447, 15: 33.74758, 16: 33.29287, 17: 32.34466, 18: 32.12455, 19: 31.77379, 20: 31.76449, 21: 30.99462, 22: 30.86269, 23: 30.79768, 24: 31.53604, 25: 32.27817, 26: 32.38043, 27: 32.7301, 28: 32.98513, 29: 33.49464, 30: 33.64333, 31: 33.78753, 32: 33.65136, 33: 32.80422, 34: 32.98791, 35: 33.21873, 36: 33.5012, 37: 32.12444, 38: 32.21407, 39: 31.86206, 40: 31.53601, 41: 31.06277, 42: 30.98402, 43: 31.02615}})
df2 = pd.DataFrame.from_dict({'President': {'1969-01-20': 'Nixon', '1974-08-09': 'Ford', '1977-01-20': 'Carter'}, 'Start': {'1969-01-20': '1969-01-20', '1974-08-09': '1974-08-09', '1977-01-20': '1977-01-20'}, 'End': {'1969-01-20': '1974-08-09', '1974-08-09': '1977-01-20', '1977-01-20': '1981-01-20'}, 'Republican': {'1969-01-20': True, '1974-08-09': True, '1977-01-20': False}, 'Democrat': {'1969-01-20': False, '1974-08-09': False, '1977-01-20': True}})
# make dates | utc=True appeared to be needed for resampling below
df1["Date"] = pd.to_datetime(df1["Date"], utc=True)
df2["Start"] = pd.to_datetime(df2["Start"], utc=True)
df2["End"] = pd.to_datetime(df2["End"], utc=True)
# create a common date column for merging
df2["Date"] = df2["Start"]
df3 = pd.merge(df1, df2, how="outer", on="Date")
df3.sort_values("Date", inplace=True)
df3.set_index("Date", drop=False, inplace=True)
# select the numeric data columns
cols = df3.select_dtypes(include="number").columns.to_list()
# cols = ['Unemployment Rate', 'Federal Debt']
# The dataframe has missing values because the dates for the presidential terms do not coincide with the dates for which the data is published.
# I interpolated the data, but couldn't merge it back in.
"""
df3[cols]
Unemployment Rate Federal Debt
Date
1969-01-01 00:00:00+00:00 3.4 36.19577
1969-01-20 00:00:00+00:00 NaN NaN
1969-04-01 00:00:00+00:00 3.4 34.97403
"""
# resample and interpolate to replace na values
df4 = df3[cols].resample("D").interpolate(method="linear").reset_index()
# merge the numeric columns back into the dataframe | merge back
df5 = pd.merge(df3, df4, how="inner", on=cols)# tried "left" too
"""
df4.head(30)
Date Unemployment Rate Federal Debt
0 1969-01-01 00:00:00+00:00 3.4 36.195770
1 1969-01-02 00:00:00+00:00 3.4 36.182195
2 1969-01-03 00:00:00+00:00 3.4 36.168620
"""
# desired outcome for df5:
"""
df5[cols]
Unemployment Rate Federal Debt
Date
1969-01-01 00:00:00+00:00 3.4 36.19577
1969-01-20 00:00:00+00:00 3.4 35.937847
1969-04-01 00:00:00+00:00 3.4 34.97403
"""
# fill missing rows forward for selected columns | merge back
cols = ["Start", "End", "President"]
df3[cols] = df3[cols].ffill(axis=0)
# This matches the desired output
"""
df3[cols]
Start End President
Date
1969-01-01 00:00:00+00:00 NaT NaT NaN
1969-01-20 00:00:00+00:00 1969-01-20 00:00:00+00:00 1974-08-09 00:00:00+00:00 Nixon
1969-04-01 00:00:00+00:00 1969-01-20 00:00:00+00:00 1974-08-09 00:00:00+00:00 Nixon
"""
# fill boolean Republican/Democrat with appropriate True/False | merge back
cols = ["Republican", "Democrat"]
df3[cols] = df3[cols].ffill(axis=0)
# also matches the desired output
"""
df3[cols]
Republican Democrat
Date
1969-01-01 00:00:00+00:00 NaN NaN
1969-01-20 00:00:00+00:00 True False
1969-04-01 00:00:00+00:00 True False
To sum up, I have managed to compute just about what I needed, but couldn't merge it back nicely into a clean dataframe for further processing. Either I dropped too many rows, too few, or I still had missing data where it should have been interpolated/filled.
IIUC, right after the merge, just interpolate
on time
and ffill
the columns of interest :
df1["Date"] = pd.to_datetime(df1["Date"])
df2[["Start", "End"]] = df2[["Start", "End"]].apply(pd.to_datetime)
tmp = (
pd.merge(df1, df2, how="outer", left_on=df1["Date"], right_on=df2["Start"])
.sort_values("key_0", ignore_index=True).drop("Date", axis=1)
.rename({"key_0": "Date"}, axis=1).set_index("Date")
)
ffill_cols = ["Republican", "Democrat"] + ["Start", "End"] + ["President"]
inter_cols = ["Unemployment Rate", "Federal Debt"]
out1 = (tmp.assign(
**{col: tmp[col].ffill() for col in ffill_cols},
**{col: tmp[col].interpolate("time") for col in inter_cols})
.reset_index())
Note that if you want to ffill all the columns, you can use merge_ordered
:
out2 = (
pd.merge_ordered(df1, df2, how="outer",
left_on=df1["Date"], right_on=df2["Start"], fill_method="ffill")
.sort_values("key_0").drop("Date", axis=1)
.rename({"key_0": "Date"}, axis=1)
)
Output :
print(out1)
Date Unemployment Rate Federal Debt President Start End Republican Democrat
0 1969-01-01 3.40 36.20 NaN NaT NaT NaN NaN
1 1969-01-20 3.40 35.94 Nixon 1969-01-20 1974-08-09 True False
2 1969-04-01 3.40 34.97 Nixon 1969-01-20 1974-08-09 True False
3 1969-07-01 3.50 35.02 Nixon 1969-01-20 1974-08-09 True False
4 1969-10-01 3.70 35.47 Nixon 1969-01-20 1974-08-09 True False
.. ... ... ... ... ... ... ... ...
42 1978-10-01 5.80 31.86 Carter 1977-01-20 1981-01-20 False True
43 1979-01-01 5.90 31.54 Carter 1977-01-20 1981-01-20 False True
44 1979-04-01 5.80 31.06 Carter 1977-01-20 1981-01-20 False True
45 1979-07-01 5.70 30.98 Carter 1977-01-20 1981-01-20 False True
46 1979-10-01 6.00 31.03 Carter 1977-01-20 1981-01-20 False True
[47 rows x 8 columns]