Search code examples
pythonpandasgeopy

Multiple Condition Apply Function that iterates over itself


So I have a Dataframe that is the same thing 348 times, but with a different date as a static column. What I would like to do is add a column that checks against that date and then counts the number of rows that are within 20 miles using a lat/lon column and geopy.

My frame is like this:

enter image description here

What I am looking to do is something like an apply function that takes all of the identifying dates that are equal to the column and then run this:

geopy.distance.vincenty(x, y).miles

X would be the location's lat/lon and y would be the iterative lat/lon. I'd want the count of locations in which the above is < 20. I'd then like to store this count as a column in the initial Dataframe.


I'm ok with Pandas, but this is just outside my comfort zone. Thanks.


Solution

  • I started with this DataFrame (because I did not want to type that much by hand and you did not provide any code for the data):

    df
       Index    Number        la                 ID 
    0   0          1    [43.3948, -23.9483]     1/1/90
    1   1          2    [22.8483, -34.3948]     1/1/90
    2   2          3    [44.9584, -14.4938]     1/1/90
    3   3          4    [22.39458, -55.34924]   1/1/90 
    4   4          5    [33.9383, -23.4938]     1/1/90
    5   5          6    [22.849, -34.397]       1/1/90
    

    Now I introduced an artificial column which is only there to help us get the cartesian product of the distances

    df['join'] = 1
    df_c = pd.merge(df, df[['la', 'join','Index']], on='join')
    

    The next step is to apply the vincenty function via .apply and store the result in an extra column

    df_c['distance'] = df_c.apply(lambda x: distance.vincenty(x.la_x, x.la_y).miles, 1)
    

    Now we have the cartesian product of the original matrix, which means we have the comparison of each city with itself, too. But we will take that into account in the next step by performing -1. We groupby the Index_x and sum all the distances smaller the 20 miles.

    df['num_close_cities'] = df_c.groupby('Index_x').apply(lambda x: sum((x.distance < 20))) -1
    
    df.drop('join', 1)
    
        Index   Number  la                      ID      num_close_cities
    0   0         1     [43.3948, -23.9483]     1/1/90  0
    1   1         2     [22.8483, -34.3948]     1/1/90  1
    2   2         3     [44.9584, -14.4938]     1/1/90  0
    3   3         4     [22.39458, -55.34924]   1/1/90  0
    4   4         5     [33.9383, -23.4938]     1/1/90  0
    5   5         6     [22.849, -34.397]       1/1/90  1