Search code examples
pythonpandasdataframe

How to identify consecutive repeating values in data frame column?


so I am trying to figure out how I can identify consecutive repeating values in a data frame column in python, and then be able to set a number for how many consecutive repeating values I am looking for. I will explain further here.

I have the following data frame:

DateTime                 Value         
-------------------------------
2015-03-11 06:00:00          1               
2015-03-11 07:00:00          1               
2015-03-11 08:00:00          1               
2015-03-11 09:00:00          1               
2015-03-11 10:00:00          0               
2015-03-11 11:00:00          0               
2015-03-11 12:00:00          0               
2015-03-11 13:00:00          0               
2015-03-11 14:00:00          0               
2015-03-11 15:00:00          0               
...

Now I have the following question: In the "Value" column, is there ever an instance where there are "2" or more consecutive "0" values? Yes! Now I want to return a "True".

Now I have this data frame:

DateTime                 Value         
-------------------------------
2015-03-11 06:00:00          1               
2015-03-11 07:00:00          1               
2015-03-11 08:00:00          0               
2015-03-11 09:00:00          0               
2015-03-11 10:00:00          1               
2015-03-11 11:00:00          0               
2015-03-11 12:00:00          0               
2015-03-11 13:00:00          0               
2015-03-11 14:00:00          1               
2015-03-11 15:00:00          1               
...

Now I have the following question: In the "Value" column, is there ever an instance where there are "3" or more consecutive "0" values? Yes! Now I want to return a "True".

And of course, if the answer is "No", then I would want to return a "False"

How can this be done in python? What is this process even called? How can you set this so that you can change the number of consecutive values being looked for?


Solution

  • To detect consecutive runs in the series, we first detect the turning points by looking at the locations where difference with previous entry isn't 0. Then cumulative sum of this marks the groups:

    # for the second frame
    >>> consecutives = df["Value"].diff().ne(0).cumsum()
    >>> consecutives
    
    0    1
    1    1
    2    2
    3    2
    4    3
    5    4
    6    4
    7    4
    8    5
    9    5
    

    But since you're interested in a particular value's consecutive runs (e.g., 0), we can mask the above to put NaNs wherever we don't have 0 in the original series:

    >>> masked_consecs = consecutives.mask(df["Value"].ne(0))
    >>> masked_consecs
    
    0    NaN
    1    NaN
    2    2.0
    3    2.0
    4    NaN
    5    4.0
    6    4.0
    7    4.0
    8    NaN
    9    NaN
    

    Now we can group by this series and look at the groups' sizes:

    >>> consec_sizes = df["Value"].groupby(masked_consecs).size().to_numpy()
    >>> consec_sizes
    
    array([2, 3])
    

    The final decision can be made with the threshold given (e.g., 2) to see if any of the sizes satisfy that:

    >>> is_okay = (consec_sizes >= 2).any()
    >>> is_okay
    True
    

    Now we can wrap this procedure in a function for reusability:

    def is_consec_found(series, value=0, threshold=2):
        # mark consecutive groups
        consecs = series.diff().ne(0).cumsum()
    
        # disregard those groups that are not of `value`
        masked_consecs = consecs.mask(series.ne(value))
    
        # get size of each
        consec_sizes = series.groupby(masked_consecs).size().to_numpy()
    
        # check sizes agains the threshold
        is_okay = (consec_sizes >= threshold).any()
    
        # whether a suitable sequence is found or not
        return is_okay
    

    and we can run it as:

    # these are all for the second dataframe you posted
    >>> is_consec_found(df["Value"], value=0, threshold=2)
    True
    
    >>> is_consec_found(df["Value"], value=0, threshold=5)
    False
    
    >>> is_consec_found(df["Value"], value=1, threshold=2)
    True
    
    >>> is_consec_found(df["Value"], value=1, threshold=3)
    False