In my dataset I need to find when there are number of 0's repeated more than 280 times and get back the first row number of the start of the repetition in an array.I'm using python 3.11.
Sample data:
differences
0
0
0
0
0
0
0
.
.
.
5
5
.
.
0
0
0
0
or to create a sample dataset:
ACD=[0,5]
df2 = pd.DataFrame(np.repeat(ACD, 100, axis=0))
df3=df2.sample(frac=1,axis=1).sample(frac=1).reset_index(drop=True)
So far my code:
c=[]
for values,row in df.loc[:, ['differences']].iterrows():
i=0
while row['differences'] == 0:
count = sum(1 for i in row)
i +=1
if count > 280:
continue
c.append(np.where(row['differences']))
else:
values+=1
Expected output:
row_number_rep= [5,90,120] #showing the specific row numbers where the repetition stars.
With this code I get the error
<stdin>:8: DeprecationWarning: Calling nonzero on 0d arrays is deprecated, as it behaves surprisingly. Use
atleast_1d(arr).nonzero()
if the old behavior was intended.
I need help to improve this code. I think the problem is that I do not have first 280 items as 0's and I need to keep iterating over the whole column to find all the row number of beginning of 0's repeated 280 times.
Assuming this example and a threshold of 4
(instead of 280
):
df = pd.DataFrame({'differences': [0,0,0,0,0,1,2,0,3,0,0,0,0,0,0,4,0,5]})
differences
0 0 # 0: first stretch of >4
1 0
2 0
3 0
4 0
5 1
6 2
7 0
8 3
9 0 # 9: second stretch of >4
10 0
11 0
12 0
13 0
14 0
15 4
16 0
17 5
You could perform a groupby.size
to filter the groupby.first
:
thresh = 4
m = df['differences'].eq(0)
group = (~m).cumsum().to_numpy()
g = df.reset_index()[m].groupby(group[m])
g.size()
out = g['index'].first()[g.size()>thresh].to_numpy()
Output: array([ 0, 9])
lst = [0, 0, 0, 0, 0, 1, 2, 0, 3, 0, 0, 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 0, 0]
thresh = 4
start = -1
zeros = False
count = 0
out = []
for i, v in enumerate(lst+[-1]):
if v==0:
if not zeros:
count = 0
start = i
zeros = True
count += 1
continue
if count > thresh:
if zeros:
out.append(start)
zeros = False
out
# [0, 9, 18]
itertools.groupby
approach:from itertools import groupby
lst = [0, 0, 0, 0, 0, 1, 2, 0, 3, 0, 0, 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 0, 0]
thesh = 4
out = [x[0][0] for k,g in groupby(enumerate(lst), key=lambda x: x[1]==0)
if k and len(x:=list(g))>thresh]
# [0, 9, 18]