Search code examples
pythonpandasmergepython-datetime

Pandas: Merge two dataframes with different dates, using interpolate for numeric values and ffill for dates and boolean data


I have two pandas dataframes to merge with interpolate() on the missing numeric data and with ffill() on boolean data and dates. The source data looks like this:

df1
         Date  Unemployment Rate  Federal Debt
0  1969-01-01                3.4      36.19577
1  1969-04-01                3.4      34.97403
2  1969-07-01                3.5      35.01946

df2
           President      Start        End  Republican  Democrat
1969-01-20     Nixon 1969-01-20 1974-08-09        True     False
1974-08-09      Ford 1974-08-09 1977-01-20        True     False
1977-01-20    Carter 1977-01-20 1981-01-20       False      True

The desired output is something like:

df3
      Date  Unemployment Rate  Federal Debt  President       Start         End  Republican  Democrat
1969-01-01                3.4      36.19577        NaN         NaN         NaN         NaN       NaN
1969-01-20                3.4      35.93785      Nixon  1969-01-20  1974-08-09        True     False
1969-04-01                3.4      34.97403      Nixon  1969-01-20  1974-08-09        True     False
1969-07-01                3.5      35.01946      Nixon  1969-01-20  1974-08-09        True     False
...
1977-01-01                7.5      33.65136      Ford   1974-08-09  1977-01-20        True     False
1977-01-20                7.4      33.47252      Carter 1977-01-20  1981-01-20       False      True
1977-04-01                7.2      32.80422      Carter 1977-01-20  1981-01-20       False      True

In words, I have economic data that comes either at daily, monthly, quarterly, or annual rate and political data on the presidential term dates. The purpose of merging the dataframes is to compute statistics for each president's term and to get the (interpolated) data for the start/end of their terms, and eventually to graph this.

My difficulties seem to be mostly about merging two datasets. I show below how far I got.

import pandas as pd
from pandas import Timestamp
import numpy as np  # np.nan

df1 = pd.DataFrame.from_dict({'Date': {0: '1969-01-01', 1: '1969-04-01', 2: '1969-07-01', 3: '1969-10-01', 4: '1970-01-01', 5: '1970-04-01', 6: '1970-07-01', 7: '1970-10-01', 8: '1971-01-01', 9: '1971-04-01', 10: '1971-07-01', 11: '1971-10-01', 12: '1972-01-01', 13: '1972-04-01', 14: '1972-07-01', 15: '1972-10-01', 16: '1973-01-01', 17: '1973-04-01', 18: '1973-07-01', 19: '1973-10-01', 20: '1974-01-01', 21: '1974-04-01', 22: '1974-07-01', 23: '1974-10-01', 24: '1975-01-01', 25: '1975-04-01', 26: '1975-07-01', 27: '1975-10-01', 28: '1976-01-01', 29: '1976-04-01', 30: '1976-07-01', 31: '1976-10-01', 32: '1977-01-01', 33: '1977-04-01', 34: '1977-07-01', 35: '1977-10-01', 36: '1978-01-01', 37: '1978-04-01', 38: '1978-07-01', 39: '1978-10-01', 40: '1979-01-01', 41: '1979-04-01', 42: '1979-07-01', 43: '1979-10-01'}, 'Unemployment Rate': {0: 3.4, 1: 3.4, 2: 3.5, 3: 3.7, 4: 3.9, 5: 4.6, 6: 5.0, 7: 5.5, 8: 5.9, 9: 5.9, 10: 6.0, 11: 5.8, 12: 5.8, 13: 5.7, 14: 5.6, 15: 5.6, 16: 4.9, 17: 5.0, 18: 4.8, 19: 4.6, 20: 5.1, 21: 5.1, 22: 5.5, 23: 6.0, 24: 8.1, 25: 8.8, 26: 8.6, 27: 8.4, 28: 7.9, 29: 7.7, 30: 7.8, 31: 7.7, 32: 7.5, 33: 7.2, 34: 6.9, 35: 6.8, 36: 6.4, 37: 6.1, 38: 6.2, 39: 5.8, 40: 5.9, 41: 5.8, 42: 5.7, 43: 6.0}, 'Federal Debt': {0: 36.19577, 1: 34.97403, 2: 35.01946, 3: 35.46954, 4: 35.38879, 5: 34.67329, 6: 34.86717, 7: 35.74822, 8: 34.50345, 9: 34.36089, 10: 35.00694, 11: 35.63237, 12: 34.72622, 13: 33.67383, 14: 33.62447, 15: 33.74758, 16: 33.29287, 17: 32.34466, 18: 32.12455, 19: 31.77379, 20: 31.76449, 21: 30.99462, 22: 30.86269, 23: 30.79768, 24: 31.53604, 25: 32.27817, 26: 32.38043, 27: 32.7301, 28: 32.98513, 29: 33.49464, 30: 33.64333, 31: 33.78753, 32: 33.65136, 33: 32.80422, 34: 32.98791, 35: 33.21873, 36: 33.5012, 37: 32.12444, 38: 32.21407, 39: 31.86206, 40: 31.53601, 41: 31.06277, 42: 30.98402, 43: 31.02615}})
df2 = pd.DataFrame.from_dict({'President': {'1969-01-20': 'Nixon', '1974-08-09': 'Ford', '1977-01-20': 'Carter'}, 'Start': {'1969-01-20': '1969-01-20', '1974-08-09': '1974-08-09', '1977-01-20': '1977-01-20'}, 'End': {'1969-01-20': '1974-08-09', '1974-08-09': '1977-01-20', '1977-01-20': '1981-01-20'}, 'Republican': {'1969-01-20': True, '1974-08-09': True, '1977-01-20': False}, 'Democrat': {'1969-01-20': False, '1974-08-09': False, '1977-01-20': True}})


# make dates | utc=True appeared to be needed for resampling below
df1["Date"] = pd.to_datetime(df1["Date"], utc=True)
df2["Start"] = pd.to_datetime(df2["Start"], utc=True)
df2["End"] = pd.to_datetime(df2["End"], utc=True)

# create a common date column for merging
df2["Date"] = df2["Start"]
df3 = pd.merge(df1, df2, how="outer", on="Date")
df3.sort_values("Date", inplace=True)
df3.set_index("Date", drop=False, inplace=True)

# select the numeric data columns
cols = df3.select_dtypes(include="number").columns.to_list()
# cols = ['Unemployment Rate', 'Federal Debt']


# The dataframe has missing values because the dates for the presidential terms do not coincide with the dates for which the data is published.
# I interpolated the data, but couldn't merge it back in.
"""
df3[cols]
                           Unemployment Rate  Federal Debt
Date                                                      
1969-01-01 00:00:00+00:00                3.4      36.19577
1969-01-20 00:00:00+00:00                NaN           NaN
1969-04-01 00:00:00+00:00                3.4      34.97403
"""

# resample and interpolate to replace na values
df4 = df3[cols].resample("D").interpolate(method="linear").reset_index()

# merge the numeric columns back into the dataframe | merge back
df5 = pd.merge(df3, df4, how="inner", on=cols)# tried "left" too

"""
df4.head(30)
                        Date  Unemployment Rate  Federal Debt
0  1969-01-01 00:00:00+00:00                3.4     36.195770
1  1969-01-02 00:00:00+00:00                3.4     36.182195
2  1969-01-03 00:00:00+00:00                3.4     36.168620
"""

# desired outcome for df5:
"""
df5[cols]
                           Unemployment Rate  Federal Debt
Date                                                      
1969-01-01 00:00:00+00:00                3.4      36.19577
1969-01-20 00:00:00+00:00                3.4      35.937847
1969-04-01 00:00:00+00:00                3.4      34.97403
"""

# fill missing rows forward for selected columns | merge back
cols = ["Start", "End", "President"]
df3[cols] = df3[cols].ffill(axis=0)

# This matches the desired output
"""
df3[cols]
                                              Start                       End President
Date                                                                                   
1969-01-01 00:00:00+00:00                       NaT                       NaT       NaN
1969-01-20 00:00:00+00:00 1969-01-20 00:00:00+00:00 1974-08-09 00:00:00+00:00     Nixon
1969-04-01 00:00:00+00:00 1969-01-20 00:00:00+00:00 1974-08-09 00:00:00+00:00     Nixon
"""

# fill boolean Republican/Democrat with appropriate True/False | merge back
cols = ["Republican", "Democrat"]
df3[cols] = df3[cols].ffill(axis=0)

# also matches the desired output
"""
df3[cols]
                          Republican Democrat
Date                                         
1969-01-01 00:00:00+00:00        NaN      NaN
1969-01-20 00:00:00+00:00       True    False
1969-04-01 00:00:00+00:00       True    False

To sum up, I have managed to compute just about what I needed, but couldn't merge it back nicely into a clean dataframe for further processing. Either I dropped too many rows, too few, or I still had missing data where it should have been interpolated/filled.


Solution

  • IIUC, right after the merge, just interpolate on time and ffill the columns of interest :

    df1["Date"] = pd.to_datetime(df1["Date"])
    df2[["Start", "End"]] = df2[["Start", "End"]].apply(pd.to_datetime)
    
    tmp = (
        pd.merge(df1, df2, how="outer", left_on=df1["Date"], right_on=df2["Start"])
            .sort_values("key_0", ignore_index=True).drop("Date", axis=1)
            .rename({"key_0": "Date"}, axis=1).set_index("Date")
    )
    
    ffill_cols = ["Republican", "Democrat"] + ["Start", "End"] + ["President"]
    inter_cols = ["Unemployment Rate", "Federal Debt"]
    
    out1 = (tmp.assign(
             **{col: tmp[col].ffill() for col in ffill_cols},
             **{col: tmp[col].interpolate("time") for col in inter_cols})
               .reset_index())
    

    Note that if you want to ffill all the columns, you can use merge_ordered :

    out2 = (
        pd.merge_ordered(df1, df2, how="outer",
            left_on=df1["Date"], right_on=df2["Start"], fill_method="ffill")
            .sort_values("key_0").drop("Date", axis=1)
            .rename({"key_0": "Date"}, axis=1)
    )
    

    Output :

    print(out1)
    
             Date  Unemployment Rate  Federal Debt President      Start        End Republican Democrat
    0  1969-01-01               3.40         36.20       NaN        NaT        NaT        NaN      NaN
    1  1969-01-20               3.40         35.94     Nixon 1969-01-20 1974-08-09       True    False
    2  1969-04-01               3.40         34.97     Nixon 1969-01-20 1974-08-09       True    False
    3  1969-07-01               3.50         35.02     Nixon 1969-01-20 1974-08-09       True    False
    4  1969-10-01               3.70         35.47     Nixon 1969-01-20 1974-08-09       True    False
    ..        ...                ...           ...       ...        ...        ...        ...      ...
    42 1978-10-01               5.80         31.86    Carter 1977-01-20 1981-01-20      False     True
    43 1979-01-01               5.90         31.54    Carter 1977-01-20 1981-01-20      False     True
    44 1979-04-01               5.80         31.06    Carter 1977-01-20 1981-01-20      False     True
    45 1979-07-01               5.70         30.98    Carter 1977-01-20 1981-01-20      False     True
    46 1979-10-01               6.00         31.03    Carter 1977-01-20 1981-01-20      False     True
    
    [47 rows x 8 columns]