Search code examples
pythonpandasdataframegroup-bysplit-apply-combine

Pandas apply function to groups, and filter the original dataframe


I have a DataFrame containing objects and their coordinates:

      id        lat         lng
0   3816  18.384001  -66.114799
1   5922  20.766100 -156.434998
2   1527  21.291394 -157.843085
3   1419  21.291394 -157.843085
4   1651  21.291394 -157.843085

Multiple objects can have same coordinates. The dataframe is big (millions of records). I have a target point with coordinates (target_lat, target_lng). My goal is to find objects in the dataframe which are within X miles of the target point, as efficiently as possible.

I'm using haversine_np function adapted from this question. It takes parameters (lat_series, lng_series, lat, lng) and efficiently computes all distances between lat_series, lng_series (two Series) and (lat, lng) (two numbers).

Now my question is how to use it to filter the distances and select objects in the original dataframe.

This is my current solution:

grouper = df.groupby(['lat', 'lng'], sort=False).grouper
lat_series = grouper.result_index.get_level_values(0)  # lats of unique (lat, lng) pairs
lng_series = grouper.result_index.get_level_values(1)  # lngs of unique (lat, lng) pairs
df['location_index'] = grouper.group_info[0]  # assign index of group back to df
distances = haversine_np(lat_series, lng_series, target_lat, target_lng)
mask = distances <= 50  # let's say 50 miles; boolean mask of size = ngroups
loc_indexes = pd.Series(range(grouper.ngroups))[mask]  # select group indexes by mask
df[df.location_index.isin(loc_indexes)]  # select original records by group indexes

It seems to work, although doesn't look reliable, because when I select the relevant group indexes by using pd.Series(range(grouper.ngroups))[mask], I assume that the level values of the grouped are naturally indexed (from 0 to ngroups-1). In other words, I'm relying on the fact that the i-th element in grouper.result_index.get_level_values() corresponds to the group with label i in grouper.group_info[0]. I couldn't find a more explicit way to get that mapping.

Questions:

  1. Is the method I'm using reliable?
  2. Is there a better (safer / more concise / more efficient) method?

Solution

  • UPDATE: @DennisGolomazov has found out that this "prefiltering" is not going to work properly for longitudes and make a very good example - here is a small demo:

    In [115]: df
    Out[115]:
         id   lat    lng
    5  4444  40.0 -121.0
    0  1111  40.0 -120.0
    
    In [116]: %paste
    threshold = 60
    max_lng_factor = 69.17
    max_lat_factor = 69.41
    target_lat, target_lng = 40, -120
    mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor) \
           & \
           df.lng.sub(target_lng).abs().le(threshold/max_lng_factor)
    x = df.loc[mask, ['lat','lng']].drop_duplicates()
    ## -- End pasted text --
    
    In [117]: x
    Out[117]:
        lat    lng
    0  40.0 -120.0
    

    where the distance between these two coordinates is less than our threshold (60 miles):

    In [119]: haversine_np(-120, 40, -121, 40)
    Out[119]: 52.895043596886239
    

    Conclusion: we can prefilter latitudes, but not the longitudes:

    In [131]: df
    Out[131]:
         id   lat    lng
    5  4444  40.0 -121.0
    0  1111  40.0 -120.0
    1  2222  42.0 -121.0
    

    Correct prefiltering:

    In [132]: mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor)
         ...: x = df.loc[mask, ['lat','lng']].drop_duplicates()
         ...:
    
    In [133]: x
    Out[133]:
        lat    lng
    5  40.0 -121.0
    0  40.0 -120.0
    

    Check:

    In [135]: df.reset_index() \
         ...:   .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
         ...:           .query("distance <= @threshold"),
         ...:          on=['lat','lng'])
         ...:
    Out[135]:
       index    id   lat    lng   distance
    0      5  4444  40.0 -121.0  52.895044
    1      0  1111  40.0 -120.0   0.000000
    

    Old, partially incorrect answer:

    I would try to do prefiltering in order to optimize the calculations. For example you can easily filter out the points that are definitely outside of your "rectangle of interest".

    Demo:

    threshold = 100
    
    # http://gis.stackexchange.com/questions/142326/calculating-longitude-length-in-miles/142327#142327
    max_lng_factor = 69.17
    max_lat_factor = 69.41
    
    target_lat, target_lng = 21.29, -157.84
    
    mask = df.lat.sub(target_lat).abs().le(threshold/max_lat_factor) \
           & \
           df.lng.sub(target_lng).abs().le(threshold/max_lng_factor)
    
    x = df.loc[mask, ['lat','lng']].drop_duplicates()
    
    df.reset_index() \
      .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
              .query("distance <= @threshold"),
             on=['lat','lng']) \
      .drop('distance',1) \
      .set_index('index')
    

    Result:

    In [142]: df.reset_index() \
         ...:   .merge(x.assign(distance=haversine_np(x.lng, x.lat, target_lng, target_lat))
         ...:           .query("distance <= @threshold"),
         ...:          on=['lat','lng']) \
         ...:   .drop('distance',1) \
         ...:   .set_index('index')
         ...:
    Out[142]:
             id        lat         lng
    index
    1      5922  20.766100 -156.434998
    2      1527  21.291394 -157.843085
    3      1419  21.291394 -157.843085
    4      1651  21.291394 -157.843085