python pandas categorical-data pandas-resample

Pandas: Is it possible to down-sample categorical column?

Let's have a DataFrame log such this one:

>>> log
                           state
date_time                       
2020-01-01 00:00:00            0
2020-01-01 00:01:00            0
2020-01-01 00:02:00            0
2020-01-01 00:03:00            1
2020-01-01 00:04:00            1
2020-01-01 00:05:00            1

where state column can be either 0 or 1 (or missing). If represented with UInt8 (smallest numeric datatype supporting <NA>) one can down-sample the data like this:

>>> log.resample(dt.timedelta(minutes=2)).mean()
                           state
date_time                       
2020-01-01 00:00:00          0.0
2020-01-01 00:02:00          0.5
2020-01-01 00:04:00          1.0

The resampling works just fine, only the value 0.5 make no sense, since it can be only 0 or 1. From the same reason, it make sense to use category as dtype for this column. However, in this case the resampling will not work as the mean() method is only applicable to numerical data.

This makes a perfect sense - however - I can imagine a down-sampling & averaging procedure on categirical data where, as long as the data in group stays identical, the result will be that particular value, otherwise the result will be <NA>, like:

categorical_average(['aple', 'aple']) -> 'aple'
categorical_average(['pear', 'pear']) -> 'pear'
categorical_average(['aple', 'pear']) -> <NA>

Which for presented DataFrame log with category state column would result in:

>>> log.resample(dt.timedelta(minutes=2)).probably_some_other_method()
                         state
date_time                       
2020-01-01 00:00:00          0
2020-01-01 00:02:00       <NA>
2020-01-01 00:04:00          1

BTW, I am doing resample.main() because there are many other (numerical) columns, where it make perfect sense, I just did not mentioned it explicitelly here for simplicity.

Solution

Use custom function for test if unique values with if-else:

f = lambda x: x.iat[0] if len(x) > len(set(x)) else pd.NA
a = log.resample(dt.timedelta(minutes=2)).agg({'state':f})
print (a)
                    state
date_time                
2020-01-01 00:00:00     0
2020-01-01 00:02:00  <NA>
2020-01-01 00:04:00     1