Search code examples
pythonpython-3.xpandasdataframeresampling

Unexpected number of bins in Pandas DataFrame resample


Problem

I need to reduce the length of a DataFrame to some externally defined integer (could be two rows, 10,000 rows, etc., but will always be a reduction in overall length), but I also want to keep the resulting DataFrame representative of the original. The original DataFrame (we'll call it df) has a datetime column (utc_time) and a data value column (data_value). The datetimes are always sequential, non-repeating, though not evenly spaced (i.e., data might be "missing"). For the DataFrame in this example, the timestamps are at ten minute intervals (when data is present).

Attempts

To accomplish this, my mind immediately went to resampling with the following logic: find the difference in seconds between the first and last timestamps, divide that by the desired final length, and that's the resampling factor. I set this up here:

# Define the desired final length.
final_length = 2
# Define the first timestamp.
first_timestamp = df['utc_time'].min().timestamp()
# Define the last timestamp.
last_timestamp = df['utc_time'].max().timestamp()
# Define the difference in seconds between the first and last timestamps.
delta_t = last_timestamp - first_timestamp
# Define the resampling factor.
resampling_factor = np.ceil(delta_t / final_length)

# Set the index from the `utc_time` column so that we can resample nicely.
df.set_index('utc_time', drop=True, inplace=True)
# Do the resampling.
resamp = df.resample(f'{resampling_factor}S')

To look at resamp, I simply looped and printed:

for i in resamp:
    print(i)

This yielded (with some cleanup on my part) the following:

 utc_time                  data_value
 2016-09-28 21:10:00       140.0
 2016-09-28 21:20:00       250.0
 2016-09-28 21:30:00       250.0
 2016-09-28 21:40:00       240.0
 2016-09-28 21:50:00       240.0
 ...                         ...
 2018-08-06 13:00:00       240.0
 2018-08-06 13:10:00       240.0
 2018-08-06 13:20:00       240.0
 2018-08-06 13:30:00       240.0
 2018-08-06 13:40:00       230.0
 
 [69889 rows x 1 columns])

 utc_time                  data_value
 2018-08-06 13:50:00       230.0
 2018-08-06 14:00:00       230.0
 2018-08-06 14:10:00       230.0
 2018-08-06 14:20:00       230.0
 2018-08-06 14:30:00       230.0
 ...                         ...
 2020-06-14 02:50:00       280.0
 2020-06-14 03:00:00       280.0
 2020-06-14 03:10:00       280.0
 2020-06-14 03:20:00       280.0
 2020-06-14 03:30:00       280.0
 
 [97571 rows x 1 columns])

 utc_time                  data_value
 2020-06-14 03:40:00       280.0
 2020-06-14 03:50:00       280.0
 2020-06-14 04:00:00       280.0
 2020-06-14 04:10:00       280.0
 2020-06-14 04:20:00       280.0
 ...                         ...
 2020-06-15 00:10:00       280.0
 2020-06-15 00:20:00       270.0
 2020-06-15 00:30:00       270.0
 2020-06-15 00:40:00       270.0
 2020-06-15 00:50:00       280.0

 [128 rows x 1 columns])

As one can see, this produced three bins rather than the two I expected.

I could do something different, like changing the way I choose the resampling factor (e.g., finding the average time between timestamps, and multiplying that by (length of DataFrame / final_length) should yield a more conservative resampling factor), but that would, to my mind, be a mask to the underlying issue. Mainly, I'd love to understand why what's happening is happening. Which leads to...

Question

Does anyone know why this is happening, and what steps I might take to ensure we get the desired number of bins? I wonder if it's an offsetting issue - that is, although we see the first timestamp in the first bin as the first timestamp from the DataFrame, perhaps pandas is actually starting the bin before then?

For anyone who'd like to play along at home, the test DataFrame can be found here as a .csv. To get it in as a DataFrame:

df = pd.read_csv('test.csv', parse_dates=[0])


Solution

  • Summary

    • Problem 1 & fix: The way you form the bins will make one extra bin since the bins created with df.resample() will be closed only on one end (left or right). Fix this with one of options listed in "1.".

    • Problem 2 & fix: The first bin left edge is at the start of that day ('2016-09-28 00:00:00') (See "2."). You can fix it by using kind='period' as argument to resample(). (See "3.")

    1. Having a glance at the input data (& what kind of bins we need)

    The input data is from 2016-09-28 21:10:00 to 2020-06-15 00:50:00, and using the resampling_factor you have, we get:

    In [63]: df.index.min()
    Out[63]: Timestamp('2016-09-28 21:10:00')
    
    In [64]: df.index.min() + pd.Timedelta(f'{resampling_factor}S')
    Out[64]: Timestamp('2018-08-07 11:00:00')
    
    In [65]: _ + pd.Timedelta(f'{resampling_factor}S')
    Out[65]: Timestamp('2020-06-15 00:50:00')
    

    To partition data into two pieces with these timestamps, we would need bins to be

    • ['2016-09-28 21:10:00', '2018-08-07 11:00:00')
    • ['2018-08-07 11:00:00', '2020-06-15 00:50:00']

    (The [ means closed end and ( means open end)

    • Here is one problem: You can not form bins that are closed from both ends. You will have to decide if you want to close the bins from left or right (argument closed='left'|'right',). With closed='left' you would have
      • ['2016-09-28 21:10:00', '2018-08-07 11:00:00')
      • ['2018-08-07 11:00:00', '2020-06-15 00:50:00')
      • ['2020-06-15 00:50:00', '2022-04-23 14:40:00') (only one entry here)

    Possible fixes:

    1. Adjust your last timestamp by adding some time into it:
        last_timestamp = (df['utc_time'].max() +
                          pd.Timedelta('10 minutes')).timestamp()
    
    1. Make the resampling_factor a bit larger than you first calculated.
    2. Just use the first two dataframes from the df.resample and disregard the third which has only one or few entries

    Choose which makes most sense in your application.

    2. Looking at what we have now

    • From the df.resample docs, we know that the labels returned are the left bin edges
    • If we look the data, we see the what kind of labels there are now.
    
    In [67]: resamp = df.resample(f'{resampling_factor}S')
    
    In [68]: itr = iter(resamp)
    
    In [69]: next(itr)
    Out[69]:
    (Timestamp('2016-09-28 00:00:00', freq='58542600S'),
                          data_value
     utc_time
     2016-09-28 21:10:00       140.0
     ...                         ...
     2018-08-06 13:40:00       230.0
    
     [69889 rows x 1 columns])
    
    In [70]: next(itr)
    Out[70]:
    (Timestamp('2018-08-06 13:50:00', freq='58542600S'),
                          data_value
     utc_time
     2018-08-06 13:50:00       230.0
     ...                         ...
     2020-06-14 03:30:00       280.0
    
     [97571 rows x 1 columns])
    
    In [71]: next(itr)
    Out[71]:
    (Timestamp('2020-06-14 03:40:00', freq='58542600S'),
                          data_value
     utc_time
     2020-06-14 03:40:00       280.0
     ...                         ...
     2020-06-15 00:50:00       280.0
    
     [128 rows x 1 columns])
    
    
    • The bins are therefore
      • ['2016-09-28 00:00:00', '2018-08-06 13:50:00')
      • ['2018-08-06 13:50:00', '2020-06-14 03:40:00')
      • ['2020-06-14 03:40:00', '2022-04-22 17:30:00') (Endpoint calculated by adding resampling_factor to the beginning of the bin.)
    • We see that the first bin does not start from the df['utc_time'].min (2016-09-28 21:10:00), but it starts from the beginning of that day (as you guessed)
    • Since the first bin starts before intended, we have data outside two bins, in a third bin.

    3. Fixing the starting bin left edge

    The kind argument can be either 'timestamp' or 'period'. If you change it into 'period', you will have following bins (with closed='left'):

    • ['2016-09-28 21:10:00', '2018-08-07 11:00:00') <-- fixed
    • ['2018-08-07 11:00:00', '2020-06-15 00:50:00')
    • ['2020-06-15 00:50:00', '2022-04-23 14:40:00') (Remove with options given in "1.")