python pandas pandas-groupby cumulative-sum

Group streetnames depends on hops/steps in the numeration

i have a df that has 4 values. they are the name of street and the numeration(start,end) and if it is odd or even. what i need is to group the numerations of a specific streetname between the min column and the end column, but if there is a gap greater than 100(that means the numbers arent use in that street) it has to group the next range in a separate series with the same street. this df is like this because each row represents an entire block in my city, but i want to leave only the ranges of numbers that street has, and to know the extend of the street.

What i managed to do so far is a column to count the changes of step called "group_ids", and then group the dataframe depends of "street", "description", "end", and "group_ids". but i cant make the code to work properly .

street	descripcion	start	end	group_ids
1 DE MAYO	odd	2201	2299	0
1 DE MAYO	odd	2301	2399	0
1 DE MAYO	odd	2401	2499	0
1 DE MAYO	odd	2501	2599	1
1 DE MAYO	odd	2801	2899	1
1 DE MAYO	odd	2901	2949	1

in this table it should change the value in "group_ids" when the value "end" is 2899 but it changes before in 2499

this the code that i have so far and added the line where i do the groupby

df = df.sort_values(["street","descripcion","start","end"],ascending=[True,True,True,True])
df["group_ids"] = (df.end+100 < df.end.shift(periods=-1)).cumsum()
df = df.groupby(["street","descripcion","group_ids"],as_index=False).agg({"start":"min","end":"max"})

this the link of the excel https://docs.google.com/spreadsheets/d/1leNMvDVwU9BHyOClxzDq07RWlS63zwbKiuoIPI5Tnv8/edit?usp=sharing

this is the output that i get

street	descripcion	group_ids	start	stop
1 DE MAYO	odd	0	801	2499
1 DE MAYO	odd	1	2501	4399
1 DE MAYO	odd	2	4401	4799
1 DE MAYO	even	2	800	2498
1 DE MAYO	even	3	2500	4398
1 DE MAYO	even	4	4400	4798

but it should be

street	descripcion	group_ids	start	end
1 DE MAYO	odd	0	801	2599
1 DE MAYO	odd	1	2801	4499
1 DE MAYO	odd	2	4601	4799
1 DE MAYO	even	2	800	2598
1 DE MAYO	even	3	2800	4498
1 DE MAYO	even	4	4600	4798

so this the solution that i found, maybe its not the most efficient, but here it is:

df["diff"]= df["start"] - df["end"].shift()

df["step"]=( df["diff"] != 2).cumsum()

df=df.groupby(["street","descripcion","step"],as_index=False).agg({"start":"min","end":"max"})

Solution

The problem is the grouping boundaries are not just based on end.

df = df.sort_values(["street","descripcion","start","end"],ascending=[True,True,True,True])
df

       street descripcion  start   end
43  1 DE MAYO        even    800   898
44  1 DE MAYO        even    900   998
45  1 DE MAYO        even   1000  1098
46  1 DE MAYO        even   1100  1198
47  1 DE MAYO        even   1200  1298
..        ...         ...    ...   ...
38  1 DE MAYO         odd   4301  4399
39  1 DE MAYO         odd   4401  4499
40  1 DE MAYO         odd   4601  4649
41  1 DE MAYO         odd   4651  4699
42  1 DE MAYO         odd   4701  4799

The boundaries are subject to three conditions.

c1 = df['end'].diff() > 100
c2 = df['street'] == df['street'].shift(-1)
c3 = df['descripcion'] == df['descripcion'].shift(-1)

chain these conditions and cumsum():

df['group_ids'] = (c1*c2*c3).cumsum()

Assign to another dataframe for better observation:

df2 = df.groupby(["street","descripcion","group_ids"],as_index=False).agg({"start":"min","end":"max"})
df2

According to your sorting from df at the beginning, the first group_ids=0 corresponds to 1 DE MAYO,even

If we wanna get the same result from your post, the sorting part should become:

df = df.sort_values(["street","descripcion","start","end"],ascending=[True,False,True,True])

Sort the result df2 as well.

df2 = df2.sort_values(["street","descripcion","start","end"],ascending=[True,False,True,True]).reset_index(drop=True)
df2