Search code examples
pythonpandasdataframeohlc

Find Yesterday's High Price by Merging Two DF's on Datetime and Date Columns


I'm trying to merge two df's, one df has a datetime column, and the other has just a date column. My application for this is to find yesterday's high price using an OHLC dataset. I've attached some starter code below, but I'll describe what I'm looking for.

Given this intraday dataset:

                  time  current_intraday_high
0  2022-02-11 09:00:00                      1
1  2022-02-11 10:00:00                      2
2  2022-02-11 11:00:00                      3
3  2022-02-11 12:00:00                      4
4  2022-02-11 13:00:00                      5
5  2022-02-14 09:00:00                      6
6  2022-02-14 10:00:00                      7
7  2022-02-14 11:00:00                      8
8  2022-02-14 12:00:00                      9
9  2022-02-14 13:00:00                     10
10 2022-02-15 09:00:00                     11
11 2022-02-15 10:00:00                     12
12 2022-02-15 11:00:00                     13
13 2022-02-15 12:00:00                     14
14 2022-02-15 13:00:00                     15
15 2022-02-16 09:00:00                     16
16 2022-02-16 10:00:00                     17
17 2022-02-16 11:00:00                     18
18 2022-02-16 12:00:00                     19
19 2022-02-16 13:00:00                     20

...and this daily dataframe:

        time  daily_high
0 2022-02-11           5
1 2022-02-14          10
2 2022-02-15          15
3 2022-02-16          20

...how can I merge them together, and have each row of the intraday dataframe contain the previous (business) day's high price, like so:

                  time  current_intraday_high  yesterdays_high
0  2022-02-11 09:00:00                      1              NaN
1  2022-02-11 10:00:00                      2              NaN
2  2022-02-11 11:00:00                      3              NaN
3  2022-02-11 12:00:00                      4              NaN
4  2022-02-11 13:00:00                      5              NaN
5  2022-02-14 09:00:00                      6              5.0
6  2022-02-14 10:00:00                      7              5.0
7  2022-02-14 11:00:00                      8              5.0
8  2022-02-14 12:00:00                      9              5.0
9  2022-02-14 13:00:00                     10              5.0
10 2022-02-15 09:00:00                     11             10.0
11 2022-02-15 10:00:00                     12             10.0
12 2022-02-15 11:00:00                     13             10.0
13 2022-02-15 12:00:00                     14             10.0
14 2022-02-15 13:00:00                     15             10.0
15 2022-02-16 09:00:00                     16             15.0
16 2022-02-16 10:00:00                     17             15.0
17 2022-02-16 11:00:00                     18             15.0
18 2022-02-16 12:00:00                     19             15.0
19 2022-02-16 13:00:00                     20             15.0

(Note the NaN's at the top because we don't have any data for Feb 10, 2022 from the intraday dataset, and see how each row contains the intraday data, plus the PREVIOUS day's max "high" price.)

Minimal reproducible example code below:

import pandas as pd


###################################################
# CREATE MOCK INTRADAY DATAFRAME
###################################################
intraday_date_time = [
    "2022-02-11 09:00:00",
                "2022-02-11 10:00:00",
                "2022-02-11 11:00:00",
                "2022-02-11 12:00:00",
                "2022-02-11 13:00:00",
    "2022-02-14 09:00:00",
                "2022-02-14 10:00:00",
                "2022-02-14 11:00:00",
                "2022-02-14 12:00:00",
                "2022-02-14 13:00:00",
    "2022-02-15 09:00:00",
                "2022-02-15 10:00:00",
                "2022-02-15 11:00:00",
                "2022-02-15 12:00:00",
                "2022-02-15 13:00:00",
    "2022-02-16 09:00:00",
                "2022-02-16 10:00:00",
                "2022-02-16 11:00:00",
                "2022-02-16 12:00:00",
                "2022-02-16 13:00:00",
]
intraday_date_time = pd.to_datetime(intraday_date_time)
intraday_df = pd.DataFrame(
    {
        "time": intraday_date_time,
        "current_intraday_high": [x for x in range(1, 21)],
    },
)
print(intraday_df)
# intraday_df.to_csv('intradayTEST.csv', index=True)


###################################################
# AGGREGATE/UPSAMPLE TO DAILY DATAFRAME
###################################################
# Aggregate to business days using intraday_df
agg_dict = {'current_intraday_high': 'max'}
daily_df = intraday_df.set_index('time').resample('B').agg(agg_dict).reset_index()
daily_df.rename(columns={"current_intraday_high": "daily_high"}, inplace=True)
print(daily_df)
# daily_df.to_csv('dailyTEST.csv', index=True)


###################################################
# MERGE THE TWO DATAFRAMES
###################################################
# Need to merge the daily dataset to the intraday dataset, such that,
# any row on the newly merged/joined/concat'd dataset will have:
# 1. The current intraday datetime in the 'time' column
# 2. The current 'intraday_high' value
# 3. The PREVIOUS DAY's 'daily_high' value
# This doesn't work as the daily_df just gets appended to the bottom
# of the intraday_df due to the datetimes/dates merging
merged_df = pd.merge(intraday_df, daily_df, how='outer', on='time')
print(merged_df)

Solution

  • pd.merge_asof allows you to easily do a merge like this.

    yesterdays_high = (intraday_df.resample('B', on='time')['current_intraday_high'].max()
                                  .shift()
                                  .rename('yesterdays_high')
                                  .reset_index())
    
    merged_df = pd.merge_asof(intraday_df, yesterdays_high)
    print(merged_df)
    

    Output:

                      time  current_intraday_high  yesterdays_high
    0  2022-02-11 09:00:00                      1              NaN
    1  2022-02-11 10:00:00                      2              NaN
    2  2022-02-11 11:00:00                      3              NaN
    3  2022-02-11 12:00:00                      4              NaN
    4  2022-02-11 13:00:00                      5              NaN
    5  2022-02-14 09:00:00                      6              5.0
    6  2022-02-14 10:00:00                      7              5.0
    7  2022-02-14 11:00:00                      8              5.0
    8  2022-02-14 12:00:00                      9              5.0
    9  2022-02-14 13:00:00                     10              5.0
    10 2022-02-15 09:00:00                     11             10.0
    11 2022-02-15 10:00:00                     12             10.0
    12 2022-02-15 11:00:00                     13             10.0
    13 2022-02-15 12:00:00                     14             10.0
    14 2022-02-15 13:00:00                     15             10.0
    15 2022-02-16 09:00:00                     16             15.0
    16 2022-02-16 10:00:00                     17             15.0
    17 2022-02-16 11:00:00                     18             15.0
    18 2022-02-16 12:00:00                     19             15.0
    19 2022-02-16 13:00:00                     20             15.0