Search code examples
pythonpandaspandas-groupbycumulative-sum

Group streetnames depends on hops/steps in the numeration


i have a df that has 4 values. they are the name of street and the numeration(start,end) and if it is odd or even. what i need is to group the numerations of a specific streetname between the min column and the end column, but if there is a gap greater than 100(that means the numbers arent use in that street) it has to group the next range in a separate series with the same street. this df is like this because each row represents an entire block in my city, but i want to leave only the ranges of numbers that street has, and to know the extend of the street.

What i managed to do so far is a column to count the changes of step called "group_ids", and then group the dataframe depends of "street", "description", "end", and "group_ids". but i cant make the code to work properly .

street descripcion start end group_ids
1 DE MAYO odd 2201 2299 0
1 DE MAYO odd 2301 2399 0
1 DE MAYO odd 2401 2499 0
1 DE MAYO odd 2501 2599 1
1 DE MAYO odd 2801 2899 1
1 DE MAYO odd 2901 2949 1

in this table it should change the value in "group_ids" when the value "end" is 2899 but it changes before in 2499

this the code that i have so far and added the line where i do the groupby

df = df.sort_values(["street","descripcion","start","end"],ascending=[True,True,True,True])
df["group_ids"] = (df.end+100 < df.end.shift(periods=-1)).cumsum()
df = df.groupby(["street","descripcion","group_ids"],as_index=False).agg({"start":"min","end":"max"})

this the link of the excel https://docs.google.com/spreadsheets/d/1leNMvDVwU9BHyOClxzDq07RWlS63zwbKiuoIPI5Tnv8/edit?usp=sharing

this is the output that i get

street descripcion group_ids start stop
1 DE MAYO odd 0 801 2499
1 DE MAYO odd 1 2501 4399
1 DE MAYO odd 2 4401 4799
1 DE MAYO even 2 800 2498
1 DE MAYO even 3 2500 4398
1 DE MAYO even 4 4400 4798

but it should be

street descripcion group_ids start end
1 DE MAYO odd 0 801 2599
1 DE MAYO odd 1 2801 4499
1 DE MAYO odd 2 4601 4799
1 DE MAYO even 2 800 2598
1 DE MAYO even 3 2800 4498
1 DE MAYO even 4 4600 4798

so this the solution that i found, maybe its not the most efficient, but here it is:

df["diff"]= df["start"] - df["end"].shift()

df["step"]=( df["diff"] != 2).cumsum()

df=df.groupby(["street","descripcion","step"],as_index=False).agg({"start":"min","end":"max"})


Solution

  • The problem is the grouping boundaries are not just based on end.

    df = df.sort_values(["street","descripcion","start","end"],ascending=[True,True,True,True])
    df
    
           street descripcion  start   end
    43  1 DE MAYO        even    800   898
    44  1 DE MAYO        even    900   998
    45  1 DE MAYO        even   1000  1098
    46  1 DE MAYO        even   1100  1198
    47  1 DE MAYO        even   1200  1298
    ..        ...         ...    ...   ...
    38  1 DE MAYO         odd   4301  4399
    39  1 DE MAYO         odd   4401  4499
    40  1 DE MAYO         odd   4601  4649
    41  1 DE MAYO         odd   4651  4699
    42  1 DE MAYO         odd   4701  4799
    

    The boundaries are subject to three conditions.

    c1 = df['end'].diff() > 100
    c2 = df['street'] == df['street'].shift(-1)
    c3 = df['descripcion'] == df['descripcion'].shift(-1)
    

    chain these conditions and cumsum():

    df['group_ids'] = (c1*c2*c3).cumsum()
    

    Assign to another dataframe for better observation:

    df2 = df.groupby(["street","descripcion","group_ids"],as_index=False).agg({"start":"min","end":"max"})
    df2
    

    enter image description here According to your sorting from df at the beginning, the first group_ids=0 corresponds to 1 DE MAYO,even





    If we wanna get the same result from your post, the sorting part should become:

    df = df.sort_values(["street","descripcion","start","end"],ascending=[True,False,True,True])
    

    Sort the result df2 as well.

    df2 = df2.sort_values(["street","descripcion","start","end"],ascending=[True,False,True,True]).reset_index(drop=True)
    df2
    

    enter image description here