i have a df that has 4 values. they are the name of street and the numeration(start,end) and if it is odd or even. what i need is to group the numerations of a specific streetname between the min column and the end column, but if there is a gap greater than 100(that means the numbers arent use in that street) it has to group the next range in a separate series with the same street. this df is like this because each row represents an entire block in my city, but i want to leave only the ranges of numbers that street has, and to know the extend of the street.
What i managed to do so far is a column to count the changes of step called "group_ids", and then group the dataframe depends of "street", "description", "end", and "group_ids". but i cant make the code to work properly .
street | descripcion | start | end | group_ids |
---|---|---|---|---|
1 DE MAYO | odd | 2201 | 2299 | 0 |
1 DE MAYO | odd | 2301 | 2399 | 0 |
1 DE MAYO | odd | 2401 | 2499 | 0 |
1 DE MAYO | odd | 2501 | 2599 | 1 |
1 DE MAYO | odd | 2801 | 2899 | 1 |
1 DE MAYO | odd | 2901 | 2949 | 1 |
in this table it should change the value in "group_ids" when the value "end" is 2899 but it changes before in 2499
this the code that i have so far and added the line where i do the groupby
df = df.sort_values(["street","descripcion","start","end"],ascending=[True,True,True,True])
df["group_ids"] = (df.end+100 < df.end.shift(periods=-1)).cumsum()
df = df.groupby(["street","descripcion","group_ids"],as_index=False).agg({"start":"min","end":"max"})
this the link of the excel https://docs.google.com/spreadsheets/d/1leNMvDVwU9BHyOClxzDq07RWlS63zwbKiuoIPI5Tnv8/edit?usp=sharing
this is the output that i get
street | descripcion | group_ids | start | stop |
---|---|---|---|---|
1 DE MAYO | odd | 0 | 801 | 2499 |
1 DE MAYO | odd | 1 | 2501 | 4399 |
1 DE MAYO | odd | 2 | 4401 | 4799 |
1 DE MAYO | even | 2 | 800 | 2498 |
1 DE MAYO | even | 3 | 2500 | 4398 |
1 DE MAYO | even | 4 | 4400 | 4798 |
but it should be
street | descripcion | group_ids | start | end |
---|---|---|---|---|
1 DE MAYO | odd | 0 | 801 | 2599 |
1 DE MAYO | odd | 1 | 2801 | 4499 |
1 DE MAYO | odd | 2 | 4601 | 4799 |
1 DE MAYO | even | 2 | 800 | 2598 |
1 DE MAYO | even | 3 | 2800 | 4498 |
1 DE MAYO | even | 4 | 4600 | 4798 |
so this the solution that i found, maybe its not the most efficient, but here it is:
df["diff"]= df["start"] - df["end"].shift()
df["step"]=( df["diff"] != 2).cumsum()
df=df.groupby(["street","descripcion","step"],as_index=False).agg({"start":"min","end":"max"})
The problem is the grouping boundaries are not just based on end
.
df = df.sort_values(["street","descripcion","start","end"],ascending=[True,True,True,True])
df
street descripcion start end
43 1 DE MAYO even 800 898
44 1 DE MAYO even 900 998
45 1 DE MAYO even 1000 1098
46 1 DE MAYO even 1100 1198
47 1 DE MAYO even 1200 1298
.. ... ... ... ...
38 1 DE MAYO odd 4301 4399
39 1 DE MAYO odd 4401 4499
40 1 DE MAYO odd 4601 4649
41 1 DE MAYO odd 4651 4699
42 1 DE MAYO odd 4701 4799
The boundaries are subject to three conditions.
c1 = df['end'].diff() > 100
c2 = df['street'] == df['street'].shift(-1)
c3 = df['descripcion'] == df['descripcion'].shift(-1)
chain these conditions and cumsum()
:
df['group_ids'] = (c1*c2*c3).cumsum()
Assign to another dataframe for better observation:
df2 = df.groupby(["street","descripcion","group_ids"],as_index=False).agg({"start":"min","end":"max"})
df2
According to your sorting from
df
at the beginning, the first group_ids
=0 corresponds to
1 DE MAYO
,even
If we wanna get the same result from your post, the sorting part should become:
df = df.sort_values(["street","descripcion","start","end"],ascending=[True,False,True,True])
Sort the result df2
as well.
df2 = df2.sort_values(["street","descripcion","start","end"],ascending=[True,False,True,True]).reset_index(drop=True)
df2