Search code examples
pythonpandasresampling

How do I make Pandas resample starting first day of each year in DataFrame


I have a dataframe containing daily data

import pandas as pd
import numpy as np

# Set the random seed for reproducibility
np.random.seed(42)

# Generate random data
dates = pd.date_range(start='2020-06-25', periods=1679, freq='D')
open_prices = np.random.uniform(low=100, high=200, size=len(dates))
high_prices = open_prices + np.random.uniform(low=0, high=10, size=len(dates))
low_prices = open_prices - np.random.uniform(low=0, high=10, size=len(dates))
close_prices = np.random.uniform(low=low_prices, high=high_prices)

# Create the DataFrame
ohlc_data = pd.DataFrame({
    'Open': open_prices,
    'High': high_prices,
    'Low': low_prices,
    'Close': close_prices
}, index=dates)

Output:

            Open        High        Low         Close
Date        
2020-06-25  137.454012  144.403523  129.235702  143.945741
2020-06-26  195.071431  198.532428  188.476323  195.793458
2020-06-27  173.199394  182.955496  165.236980  181.584150
2020-06-28  159.865848  166.275569  157.146388  164.104760
2020-06-29  115.601864  123.826670  108.678275  117.641837
... ... ... ... ...
2025-01-24  179.003044  184.073640  173.358878  175.845545
2025-01-25  130.467914  132.347118  124.462082  130.051784
2025-01-26  108.091928  108.861645  106.429375  108.432944
2025-01-27  140.298018  147.259579  136.500644  145.721364
2025-01-28  117.352451  121.180439  111.180509  115.331552

I need to resample data to 3 days starting from the first day for each year in DataFrame

agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
resampled = ohlc_data.resample('3D').agg(agg)

resampled

Output:

            Open        High        Low         Close
Date
2020-06-25  137.454012  198.532428  129.235702  181.584150
2020-06-28  159.865848  166.275569  108.678275  113.720371
2020-07-01  105.808361  195.845186  96.417676   161.755865
2020-07-04  170.807258  198.739371  100.149184  192.952485
2020-07-07  183.244264  188.269925  115.280878  118.845486
... ... ... ... ...
2025-01-15  185.142496  195.768948  111.549399  191.451661
2025-01-18  111.636640  184.883239  109.047597  136.075629
2025-01-21  187.797432  191.054177  175.267831  176.687211
2025-01-24  179.003044  184.073640  106.429375  108.432944
2025-01-27  140.298018  147.259579  111.180509  115.331552

first year:

resampled.loc['2020-01-01': '2020-06-26']

Output:

            Open        High        Low         Close
Date
2020-06-25  137.454012  198.532428  129.235702  181.584150
2020-06-28  159.865848  166.275569  108.678275  113.720371

This is okay for now because I don't have data before 2020-06-25

second year:

resampled.loc['2021-01-01': '2021-01-06']

Output:

            Open        High        Low         Close
Date
2021-01-03  190.041806  192.176919  128.095988  134.056420
2021-01-06  134.920957  195.614503  129.865870  189.599085

resampling here start from 2021-01-03 , I need it to start from 2021-01-01

third year:

resampled.loc['2022-01-01': '2022-01-06']

Output:

            Open        High        Low         Close
Date
2022-01-01  140.348287  147.026333  98.926715   103.562334
2022-01-04  175.513726  177.367027  158.572894  169.912020

resampling in this year work as what I need, starting from 2022-01-01

I tried using origin parameter:

agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
ori = str(ohlc_data.index[0].date().replace(month=1, day=1))
resampled = ohlc_data.resample('3D', origin=ori).agg(agg)

but this works only with first year in DataFrame


Solution

  • Here's one approach:

    agg = {'Open': 'first', 'High': 'max', 'Low': 'min', 'Close': 'last'}
    
    out = (
        ohlc_data.groupby(ohlc_data.index.year, group_keys=False)
        .apply(
            lambda g: g.resample('3D', origin=pd.Timestamp(g.name, 1, 1))
            .agg(agg)
        )
    )
    

    Output:

    out[~out.index.year.duplicated()]
    
                      Open        High         Low       Close
    2020-06-23  137.454012  144.403523  129.235702  143.945741
    2021-01-01  109.310277  197.899429  106.286082  188.085355
    2022-01-01  140.348287  147.026333   98.926715  103.562334
    2023-01-01  186.846798  189.998543  144.825695  189.588286
    2024-01-01  151.771164  160.336537  101.736241  128.839795
    2025-01-01  130.312836  137.712451  124.656259  135.872983
    

    Explanation