Search code examples
pythonpandasfor-loopzipsliding-window

Sliding windows - measuring length of observations on each looped window


Let's analyse this sample code where zip() is used to create different windows from a dataset and return them in loop.

months = [Jan, Feb, Mar, Apr, May]

for x, y in zip(months, months[1:]):
    print(x, y)

# Output of each window will be:
Jan Feb 
Feb Mar
Mar Apr
Apr May

Let's suppose that now I want to calculate the respective length percentage between the months used in each window.

Example in steps:

  1. When returning the first window (Jan Feb), I want to calculate the % length of Jan over the full window (which equals to Jan + Feb) and return it a new variable
  2. When returning the second window (Feb Mar), I want to calculate the % length of Feb over the full window (which equals to Feb + Mar) and return it a new variable
  3. Continuing this process until last window

Any suggestions on how I might implement this idea in the for loop are welcome!

Thank you!

EDIT

months = [Jan, Feb, Mar, Apr, May]

for x, y in zip(months, months[2:]):
    print(x, y)

# Output of each window will be:
Jan Feb March
Feb Mar Apr
Mar Apr May

The goal is to calculate the length of two months on each window over the full window length:

  • 1st window: Jan + Feb / Jan + Feb + March
  • 2nd window: Feb + Mar / Feb + Mar + Apr
  • continuing to last window

We can now calculate one month over the size of each window (with start.month). However, how do we adapt this to include more than one month?

Also, instead of using days_in_month, would there be a way to use the length of the datapoints (rows) in each month?

By using length of datapoints (rows) I mean that each month has many datapoints in 'time' format (e.g., 60 mins format). This would imply that 1 day in a month would have 24 different datapoints (rows). Example:

                         column
rows             
01-Jan-2010 T00:00        value
01-Jan-2010 T01:00        value
01-Jan-2010 T02:00        value
...                       ...
01-Jan-2010 T24:00        value
02-Jan-2010 T00:00        value
...                       ...

Thank you!


Solution

  • Here is one way. (In my case, months is a period_range object.)

    import pandas as pd
    months = pd.period_range(start='2020-01', periods=5, freq='M')
    

    Now, iterate over range. Each iteration is a two-month window.

    # print header labels
    print('{:10s} {:10s} {:>10s} {:>10s} {:>10s} {:>10s} '.format(
        'start', 'end', 'month', 'front (d)', 'total (d)', 'frac'))
    
    for start, end in zip(months, months[1:]):
        front_month = start.month
    
        # number of days in first month (e.g., Jan)
        front_month_days = start.days_in_month
    
        # number of days in current sliding window (e.g., Jan + Feb)
        days_in_curr_window = (end.end_time - start.start_time).days
    
        frac = front_month_days / days_in_curr_window
    
        print('{:10s} {:10s} {:10d} {:10d} {:10d} {:10.3f}'.format(
            str(start), str(end), front_month,
            front_month_days, days_in_curr_window, frac))
    
    
    start      end             month  front (d)  total (d)       frac 
    2020-01    2020-02             1         31         60      0.517
    2020-02    2020-03             2         29         60      0.483
    2020-03    2020-04             3         31         61      0.508
    2020-04    2020-05             4         30         61      0.492