Search code examples
pandasdataframepandas-groupbydistancelatitude-longitude

Calculate statistics on subset of a dataframe based on values in dataframe (latitude and longitude)


I am looking to calculate summary statistics on subsets of a dataframe but related to a specific values within the row.

For example, I have a dataframe that has latitude and longitude and number of people.

df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
                   'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
                   'people': [1,2,3,4,5]})

I want to know the total people within .05 miles from each row. This can be easily created with a loop, but as the space starts to increase this becomes unusable.

Current/Sample:

from geopy.distance import distance

def distance_calc (row, focus_lat, focus_long):
    start = (row['latitude'], row['longitude'])
    stop = (focus_lat, focus_long)
    return distance(start, stop).miles

df['total_people_within_05'] = 0
df['total_rows_within_05'] = 0

for index, row in df.iterrows():
    focus_lat = df['latitude'][index]
    focus_long = df['longitude'][index]
    new_df = df.copy()
    new_df['distance'] = new_df.apply (lambda row: (distance_calc(row, focus_lat, focus_long)),axis=1)
    df.at[index, 'total_people_within_05'] = new_df.loc[new_df.distance<=.05]['people'].sum()
    df.at[index, 'total_rows_within_05'] = new_df.loc[new_df.distance<=.05].shape[0]

Is there any pythonic way to do this?


Solution

    • Cartesian product to itself to get all combinations. This will be expensive on larger datasets. This generates N^2 rows, so in this case 25 rows
    • calculate distance on each of these combinations
    • filter query() to distances required
    • groupby() to get total number of people. Also generate a list of indexes included in total for helping with transparency
    • finally join() this back together and you have what you want
    import geopy.distance as gd
    
    df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
                       'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
                       'people': [1,2,3,4,5]})
    
    df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
     .assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
                                                                     (r.latitude_y,r.longitude_y)).miles, axis=1))
     .query("distance<=0.05")
     .rename(columns={"people_y":"nearby"})
     .groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
    ))
    
    print(df.to_markdown())
    
    latitude longitude people nearby index_y
    0 40.9919 -106.049 1 6 [0, 1, 2]
    1 40.992 -106.049 2 6 [0, 1, 2]
    2 40.9916 -106.049 3 6 [0, 1, 2]
    3 40.9899 -106.05 4 4 [3]
    4 40.9878 -106.049 5 5 [4]

    Update - use combinations instead of Cartesian product

    It's been bugging me that a Cartesian product is a huge overhead, when all that is required is to calculate distances between valid combinations

    • make use of itertools.combinations() to make a list of valid combinations of indexes
    • calculate distances between this minimum set
    • filter down to only distances we're interested in
    • now build permutations of this smaller set to provide a simple join to actual data
    • join and aggregate
    # get distances between all valid combinations
    dfd = (pd.DataFrame(list(itertools.combinations(df.index, 2)))
     .merge(df, left_on=0, right_index=True)
     .merge(df, left_on=1, right_index=True, suffixes=("_0","_1"))
     .assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_0,r.longitude_0),
                                                                     (r.latitude_1,r.longitude_1)).miles, axis=1))
     .loc[:,[0,1,"distance"]]
     # filter down to close proximities
     .query("distance <= 0.05")
    )
    
    # build all valid permuations of close by combinations
    dfnppl = (pd.DataFrame(itertools.permutations(pd.concat([dfd[0],dfd[1]]).unique(), 2))
     .merge(df.loc[:,"people"], left_on=1, right_index=True)
    )
    
    # bring it all together
    df = (df.reset_index().rename(columns={"index":0}).merge(dfnppl, on=0, suffixes=("","_near"), how="left")
     .groupby(0).agg({**{c:"first" for c in df.columns}, **{"people_near":"sum"}})
    )
    
    0 latitude longitude people people_near
    0 40.9919 -106.049 1 5
    1 40.992 -106.049 2 4
    2 40.9916 -106.049 3 3
    3 40.9899 -106.05 4 0
    4 40.9878 -106.049 5 0