so I am trying to figure out how I can identify consecutive repeating values in a data frame column in python, and then be able to set a number for how many consecutive repeating values I am looking for. I will explain further here.
I have the following data frame:
DateTime Value
-------------------------------
2015-03-11 06:00:00 1
2015-03-11 07:00:00 1
2015-03-11 08:00:00 1
2015-03-11 09:00:00 1
2015-03-11 10:00:00 0
2015-03-11 11:00:00 0
2015-03-11 12:00:00 0
2015-03-11 13:00:00 0
2015-03-11 14:00:00 0
2015-03-11 15:00:00 0
...
Now I have the following question: In the "Value" column, is there ever an instance where there are "2" or more consecutive "0" values? Yes! Now I want to return a "True".
Now I have this data frame:
DateTime Value
-------------------------------
2015-03-11 06:00:00 1
2015-03-11 07:00:00 1
2015-03-11 08:00:00 0
2015-03-11 09:00:00 0
2015-03-11 10:00:00 1
2015-03-11 11:00:00 0
2015-03-11 12:00:00 0
2015-03-11 13:00:00 0
2015-03-11 14:00:00 1
2015-03-11 15:00:00 1
...
Now I have the following question: In the "Value" column, is there ever an instance where there are "3" or more consecutive "0" values? Yes! Now I want to return a "True".
And of course, if the answer is "No", then I would want to return a "False"
How can this be done in python? What is this process even called? How can you set this so that you can change the number of consecutive values being looked for?
To detect consecutive runs in the series, we first detect the turning points by looking at the locations where difference with previous entry isn't 0. Then cumulative sum of this marks the groups:
# for the second frame
>>> consecutives = df["Value"].diff().ne(0).cumsum()
>>> consecutives
0 1
1 1
2 2
3 2
4 3
5 4
6 4
7 4
8 5
9 5
But since you're interested in a particular value's consecutive runs (e.g., 0), we can mask the above to put NaN
s wherever we don't have 0 in the original series:
>>> masked_consecs = consecutives.mask(df["Value"].ne(0))
>>> masked_consecs
0 NaN
1 NaN
2 2.0
3 2.0
4 NaN
5 4.0
6 4.0
7 4.0
8 NaN
9 NaN
Now we can group by this series and look at the groups' sizes:
>>> consec_sizes = df["Value"].groupby(masked_consecs).size().to_numpy()
>>> consec_sizes
array([2, 3])
The final decision can be made with the threshold given (e.g., 2) to see if any of the sizes satisfy that:
>>> is_okay = (consec_sizes >= 2).any()
>>> is_okay
True
Now we can wrap this procedure in a function for reusability:
def is_consec_found(series, value=0, threshold=2):
# mark consecutive groups
consecs = series.diff().ne(0).cumsum()
# disregard those groups that are not of `value`
masked_consecs = consecs.mask(series.ne(value))
# get size of each
consec_sizes = series.groupby(masked_consecs).size().to_numpy()
# check sizes agains the threshold
is_okay = (consec_sizes >= threshold).any()
# whether a suitable sequence is found or not
return is_okay
and we can run it as:
# these are all for the second dataframe you posted
>>> is_consec_found(df["Value"], value=0, threshold=2)
True
>>> is_consec_found(df["Value"], value=0, threshold=5)
False
>>> is_consec_found(df["Value"], value=1, threshold=2)
True
>>> is_consec_found(df["Value"], value=1, threshold=3)
False