Search code examples
pythonpandasnumpytime-seriesseries

Calculating an average, when only some "checkpoints" are known


I'm trying to wrap my head around pandas/numpy to do something that should probably be simple (but my lack of knowledge...)

The series looks something like this:

    2022-01-01 | 100
    2022-01-02 | nan
    2022-01-03 | nan
    2022-01-04 | 200
    2022-01-05 | nan
    2022-01-06 | nan
    2022-01-07 | nan
    2022-01-08 | 250
    2022-01-09 | nan
    2022-01-10 | 400
    2022-01-11 | nan

This represents a consumption. So, on the example above, on the 8 of January, we see that it became 250 and the last known value was 200 on the 4th of January. So, we know that on that period, the average was 12.5 per day. What I want to obtain is exactly this: the daily average (retrospectively). So, from the example above, that would be the result I'm looking for:

    2022-01-01 | 33.3   | 100
    2022-01-02 | 33.3   | 133.3
    2022-01-03 | 33.3   | 166.6
    2022-01-04 | 12.5   | 200
    2022-01-05 | 12.5   | 212.5
    2022-01-06 | 12.5   | 225
    2022-01-07 | 12.5   | 237.5
    2022-01-08 | 75     | 250
    2022-01-09 | 75     | 325
    2022-01-10 | 0      | 400
    2022-01-11 | 0      | 400

The last column is just for checking that everything is cumulated as expected. But it's not needed, I'm fine with having a Series as output.

I've already tried twisting the Series in many ways, also doing it with the debugger. But alas, I think it would be better to ask someone who knows, along with some explanation so that I can understand better how to reach that result.


Solution

  • You're looking for pd.Series.interpolate

    >>> series
    date
    2022-01-01    100.0
    2022-01-02      NaN
    2022-01-03      NaN
    2022-01-04    200.0
    2022-01-05      NaN
    2022-01-06      NaN
    2022-01-07      NaN
    2022-01-08    250.0
    2022-01-09      NaN
    2022-01-10    400.0
    2022-01-11      NaN
    Name: value, dtype: float64
    
    >>> series.interpolate()
    date
    2022-01-01    100.000000
    2022-01-02    133.333333
    2022-01-03    166.666667
    2022-01-04    200.000000
    2022-01-05    212.500000
    2022-01-06    225.000000
    2022-01-07    237.500000
    2022-01-08    250.000000
    2022-01-09    325.000000
    2022-01-10    400.000000
    2022-01-11    400.000000
    Name: value, dtype: float64
    
    >>> series.interpolate().diff().shift(-1)
    date
    2022-01-01    33.333333
    2022-01-02    33.333333
    2022-01-03    33.333333
    2022-01-04    12.500000
    2022-01-05    12.500000
    2022-01-06    12.500000
    2022-01-07    12.500000
    2022-01-08    75.000000
    2022-01-09    75.000000
    2022-01-10     0.000000
    2022-01-11          NaN
    Name: value, dtype: float64