I have a dataframe containing daily data
import pandas as pd
import numpy as np
# Set the random seed for reproducibility
np.random.seed(42)
# Generate random data
dates = pd.date_range(start='2020-06-25', periods=1679, freq='D')
open_prices = np.random.uniform(low=100, high=200, size=len(dates))
high_prices = open_prices + np.random.uniform(low=0, high=10, size=len(dates))
low_prices = open_prices - np.random.uniform(low=0, high=10, size=len(dates))
close_prices = np.random.uniform(low=low_prices, high=high_prices)
# Create the DataFrame
ohlc_data = pd.DataFrame({
'Open': open_prices,
'High': high_prices,
'Low': low_prices,
'Close': close_prices
}, index=dates)
Output:
Open High Low Close
Date
2020-06-25 137.454012 144.403523 129.235702 143.945741
2020-06-26 195.071431 198.532428 188.476323 195.793458
2020-06-27 173.199394 182.955496 165.236980 181.584150
2020-06-28 159.865848 166.275569 157.146388 164.104760
2020-06-29 115.601864 123.826670 108.678275 117.641837
... ... ... ... ...
2025-01-24 179.003044 184.073640 173.358878 175.845545
2025-01-25 130.467914 132.347118 124.462082 130.051784
2025-01-26 108.091928 108.861645 106.429375 108.432944
2025-01-27 140.298018 147.259579 136.500644 145.721364
2025-01-28 117.352451 121.180439 111.180509 115.331552
I need to resample data to 3 days starting from the first day for each year in DataFrame
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
resampled = ohlc_data.resample('3D').agg(agg)
resampled
Output:
Open High Low Close
Date
2020-06-25 137.454012 198.532428 129.235702 181.584150
2020-06-28 159.865848 166.275569 108.678275 113.720371
2020-07-01 105.808361 195.845186 96.417676 161.755865
2020-07-04 170.807258 198.739371 100.149184 192.952485
2020-07-07 183.244264 188.269925 115.280878 118.845486
... ... ... ... ...
2025-01-15 185.142496 195.768948 111.549399 191.451661
2025-01-18 111.636640 184.883239 109.047597 136.075629
2025-01-21 187.797432 191.054177 175.267831 176.687211
2025-01-24 179.003044 184.073640 106.429375 108.432944
2025-01-27 140.298018 147.259579 111.180509 115.331552
first year:
resampled.loc['2020-01-01': '2020-06-26']
Output:
Open High Low Close
Date
2020-06-25 137.454012 198.532428 129.235702 181.584150
2020-06-28 159.865848 166.275569 108.678275 113.720371
This is okay for now because I don't have data before 2020-06-25
second year:
resampled.loc['2021-01-01': '2021-01-06']
Output:
Open High Low Close
Date
2021-01-03 190.041806 192.176919 128.095988 134.056420
2021-01-06 134.920957 195.614503 129.865870 189.599085
resampling here start from 2021-01-03 , I need it to start from 2021-01-01
third year:
resampled.loc['2022-01-01': '2022-01-06']
Output:
Open High Low Close
Date
2022-01-01 140.348287 147.026333 98.926715 103.562334
2022-01-04 175.513726 177.367027 158.572894 169.912020
resampling in this year work as what I need, starting from 2022-01-01
I tried using origin parameter:
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
ori = str(ohlc_data.index[0].date().replace(month=1, day=1))
resampled = ohlc_data.resample('3D', origin=ori).agg(agg)
but this works only with first year in DataFrame
Here's one approach:
agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
out = (
ohlc_data.groupby(ohlc_data.index.year, group_keys=False)
.apply(
lambda g: g.resample('3D', origin=pd.Timestamp(g.name, 1, 1))
.agg(agg)
)
)
Output:
out[~out.index.year.duplicated()]
Open High Low Close
2020-06-23 137.454012 144.403523 129.235702 143.945741
2021-01-01 109.310277 197.899429 106.286082 188.085355
2022-01-01 140.348287 147.026333 98.926715 103.562334
2023-01-01 186.846798 189.998543 144.825695 189.588286
2024-01-01 151.771164 160.336537 101.736241 128.839795
2025-01-01 130.312836 137.712451 124.656259 135.872983
Explanation
df.groupby
to group by year (DatetimeIndex.year
).groupby.apply
+ df.resample
. This way we can access name
for each group to create the appropriate origin
.