Python: Why using asfreq() is dropping my last row of data?

I can not figure it out why when I am using pandas asfreq() method it is droping my last row of data.

data = rental_2012_09_to_2020_06.copy()

data.drop(columns='postcode', inplace=True)

data.columns = ['Quarter_End', 'Rental_Median']

data.index = pd.to_datetime(data['Quarter_End'])

data = data.asfreq(freq='Q', method='ffill')

data.drop(columns='Quarter_End', inplace=True)

data.info()
data.head()

Result running with above code:

result with above code

Result running commenting the asfreq() line: result withouth freq()

Any ideas? I read the documentation but it does not give details about this behaviour. Thanks!

Solution

You resampled your time-series dataframe with quarterly frequency:

data = data.asfreq(freq='Q', method='ffill')

According to pandas documentation, the Q frequency alias stands for "quarter end frequency". The first entry in the original dataframe is from 1 September 2012, and the last one is from 1 June 2020.

I'm guessing that asfreq resamples the time range such that the resampled range falls within the original range (but as far as I can see, this isn't documented). Since the first quarter end after 01/09/2012 is 30/09/2012, and the last quarter end before 01/06/2020 is 31/03/2020, the resulting date range is 30/09/2012 to 31/03/2020 (both inclusive), with quarterly frequency, resulting in 31 samples. The fact that this is exactly one less sample than the original dataframe is merely a coincidence (in the sense that it depends on the original datetime range).

Edit: By digging around in pandas source code, I've found where exactly this behaviour is defined/documented. Buried deep in the call graph of asfreq, there is the generator generate_range (specifically, pandas.core.arrays.datetimes.generate_range, as of v1.1.2), which is the core functionality defining the datetime values in a range [start, end] with a certain time offset between values (i.e. frequency).

def generate_range(start=None, end=None, periods=None, offset=BDay()): ...

Its docstring specifies:

Notes
-----
* [...]
* If both start and end are specified, the returned dates will
satisfy start <= date <= end.

Response to comments

I think you might be misunderstanding what asfreq does. It does not simply shift around values to the nearest time that falls on the boundary of the frequency; instead it creates a whole new series, using the data from the original as if it had been sampled at certain timestamps (as specified by the frequency). That is, you're resampling your data.

Try removing the method='ffil' argument from the call: you'll see that the new series will be all NaNs (except in those cases where the timestamp coincides with a timestamp from the original series)—this is because it doesn't know what the value of a sample should be at an unknown timestamp.

import numpy as np
import pandas as pd
import datetime as dtm

index = pd.DatetimeIndex(data=[
    dtm.datetime(2019, 12, 1),
    dtm.datetime(2020, 2, 17),
    dtm.datetime(2020, 3, 31),
    dtm.datetime(2020, 6, 1),
])

series = pd.Series([1, 2, 3, 4], index=index)

>>> series
2019-12-01    1
2020-02-17    2
2020-03-31    3
2020-06-01    4
dtype: int64

>>> series.asfreq('Q')
2019-12-31    NaN
2020-03-31    3.0
Freq: Q-DEC, dtype: float64

This makes perfect sense if you think about it: how is pandas supposed to know what the value was on 2019-12-31 if there is no recorded value for that date in the original data?

Of course, having a series full of NaNs isn't very useful, so we need to find a way to infer the missing data from what's available. That's where filling methods like method='ffil' (copy the last available value) or method='bfil' (copy the next available value) come into play.

Now, going back to your question: the only reason there isn't a record for 2020-06-30 is because of what the docstring of generate_range above specifies:

[...] the returned dates will satisfy start <= date <= end

i.e., the transformed datetime range will always fall within the original range, and 2020-06-30 falls outside the range (since the end of the original range is 2020-06-01).

Again, this also makes sense taking into account what we have discussed just now: usually you want to infer missing values in your resampled series (using values from your original series), and it's always easier (and safer) to interpolate (i.e. guess what the value would be at a timestep between two other values), than to extrapolate (i.e. guess values outside of the original range, either before the start or after the end), and the former is only guaranteed if you keep the new, resampled range within the original range—not to mention that 'ffil' and 'bfil' both require this condition in order to work properly (you can't forward-fill the first value if it comes before start and you can't backwards-fill the last value if it comes after end).

If this is not the behaviour you want, and instead you want to simply shift the timesteps, you have to do something else entirely. You can add an offset to the index, for example:

index = pd.date_range(
    dtm.datetime(2019, 9, 1), 
    dtm.datetime(2020, 6, 1), 
    freq=pd.DateOffset(months=3)
)

series = pd.Series([1, 2, 3, 4], index=index)

>>> index
DatetimeIndex(['2019-09-01', '2019-12-01', '2020-03-01', '2020-06-01'], dtype='datetime64[ns]', freq='<DateOffset: months=3>')

>>> index + pd.tseries.offsets.MonthEnd()
DatetimeIndex(['2019-09-30', '2019-12-31', '2020-03-31', '2020-06-30'], dtype='datetime64[ns]', freq='<DateOffset: months=3>')