Search code examples
pythonpandasdate-range

Pandas how to get the first and last dates of all non-zero value ranges in a data frame?


I have a data frame that looks as follows:

df = pd.DataFrame({"A":[10,0,30,40,0,60,70,80,90]}, index = pd.date_range(start='1/1/2020', end='1/09/2020'))
df


            A
2020-01-01  10
2020-01-02  0
2020-01-03  30
2020-01-04  40
2020-01-05  0
2020-01-06  60
2020-01-07  70
2020-01-08  80
2020-01-09  90

I want to loop over the data frame to get t0 and t1 which represent the first and last dates, respectively (i.e. index value) for each range of non-zero values of A.

In the table above, I want to get the following values for t0 and t1:

t0 = 2020-01-01 , t1 = 2020-01-01

t0 = 2020-01-03 , t1 = 2020-01-04

t0 = 2020-01-06 , t1 = 2020-01-09

Is there a simple way to do this within Pandas?


Solution

  • One option would be to create a Boolean Index based on where A=0. Then groupby the cumsum of that Index and aggregate the first and last dates (assumes index is sorted as it is here):

    new_df = df.reset_index()
    m = new_df['A'].eq(0)
    new_df = (
        new_df.groupby(m.cumsum()[~m])
            .agg(t0=('index', 'first'), t1=('index', 'last'))
            .reset_index(drop=True)
    )
    
              t0         t1
    0 2020-01-01 2020-01-01
    1 2020-01-03 2020-01-04
    2 2020-01-06 2020-01-09
    

    An alternative to grab min and max index values instead of first and last:

    new_df = df.reset_index()
    m = new_df['A'].eq(0)
    new_df = (
        new_df.groupby(m.cumsum()[~m])
            .agg(t0=('index', 'min'), t1=('index', 'max'))
            .reset_index(drop=True)
    )
    
              t0         t1
    0 2020-01-01 2020-01-01
    1 2020-01-03 2020-01-04
    2 2020-01-06 2020-01-09
    

    Indexing:

    m:

    0    False
    1     True
    2    False
    3    False
    4     True
    5    False
    6    False
    7    False
    8    False
    Name: A, dtype: bool
    

    Create Groups Separated by where 0s are:

    m.cumsum()
    
    0    0
    1    1
    2    1
    3    1
    4    2
    5    2
    6    2
    7    2
    8    2
    Name: A, dtype: int32
    

    Self filtered to exclude 0 rows:

    m.cumsum()[~m]
    
    0    0
    2    1
    3    1
    5    2
    6    2
    7    2
    8    2
    Name: A, dtype: int32
    

    Then group these rows together to determine associated dates.