Search code examples
pythonpandasdataframesklearn-pandas

nearest member in 2 similary griided dataframes with sklearn


I have 2 dataframes:

df1:

                    x             y        c0
2       468958.147443  4.633810e+06  1.253041
43      475516.484948  4.634928e+06  1.423767
72      475802.708042  4.635308e+06  1.294299
106     476658.696529  4.635686e+06  1.338760
133     472671.587615  4.636082e+06  1.325560
              ...           ...       ...
707923  394329.199687  5.006761e+06  1.155477
707980  409697.377813  5.006524e+06  1.223895
708570  411859.618686  5.006875e+06  1.093296
708576  413477.224756  5.006853e+06  1.161713
708695  445559.757010  5.006496e+06  1.149282

[12880 rows x 3 columns]

df2:

         kat    z0     kr             xx            yy
0        1.0  0.01  0.169  468526.696610  4.633654e+06
1        3.0  0.30  0.214  468757.270633  4.633653e+06
2        1.0  0.01  0.169  468066.930344  4.633965e+06
3        1.0  0.01  0.169  468297.494406  4.633964e+06
4        1.0  0.01  0.169  468528.058460  4.633963e+06
     ...   ...    ...            ...           ...
1287962  3.0  0.30  0.214  399566.653186  5.115395e+06
1287963  3.0  0.30  0.214  399781.023856  5.115391e+06
1287964  1.0  0.01  0.169  396570.675453  5.115753e+06
1287965  1.0  0.01  0.169  396785.035186  5.115750e+06
1287966  1.0  0.01  0.169  399571.712593  5.115703e+06

[1287967 rows x 5 columns]

I want to find a nearest member of df1 within certain radius, lets say radius=500 of df2. Then I want to put this nearest c0 values to df2. In case there is no df1 point within radius=500 I want to set c0 to 1.0 in df2. (x,y) and (xx,yy) are plane coordinates of df1 and df2, respectively.

Desired output( sample for first 5 rows only ):

         kat    z0     kr             xx            yy  c0
0        1.0  0.01  0.169  468526.696610  4.633654e+06  1.253041
1        3.0  0.30  0.214  468757.270633  4.633653e+06  1.253041
2        1.0  0.01  0.169  468066.930344  4.633965e+06  1.0
3        1.0  0.01  0.169  468297.494406  4.633964e+06  1.0
4        1.0  0.01  0.169  468528.058460  4.633963e+06  1.0
     ...   ...    ...            ...           ...
1287962  3.0  0.30  0.214  399566.653186  5.115395e+06  ...
1287963  3.0  0.30  0.214  399781.023856  5.115391e+06  ...
1287964  1.0  0.01  0.169  396570.675453  5.115753e+06  ...
1287965  1.0  0.01  0.169  396785.035186  5.115750e+06  ...
1287966  1.0  0.01  0.169  399571.712593  5.115703e+06  ...

I was thinking about converting this into shapefiles and working in some spatial query software. But I believe effective solution can be found here with sklearn. Thanks in advance !


Solution

  • If I understand your requirement correctly, you may use scipy cKDTree. It has a reputation of quite fast due to the C/Cython implementation. Give it a try to see if it helps you.

    I use only first 5 rows from your df2 for my df2. My df1 is the same as your sample df1. I also assume column c0 is the last column in df1 and the distance is Euclidean

    from scipy.spatial import cKDTree
    
    df1_cTree = cKDTree(df1[['x','y']])
    ix_arr = df1_cTree.query(df2[['xx','yy']], k=1, distance_upper_bound=500)[1]
    
    df2['c0'] = [df1.iloc[x, -1] if x < len(df1) else 1 for x in ix_arr]
    
    Out[438]:
       kat    z0     kr             xx         yy        c0
    0  1.0  0.01  0.169  468526.696610  4633654.0  1.253041
    1  3.0  0.30  0.214  468757.270633  4633653.0  1.253041
    2  1.0  0.01  0.169  468066.930344  4633965.0  1.000000
    3  1.0  0.01  0.169  468297.494406  4633964.0  1.000000
    4  1.0  0.01  0.169  468528.058460  4633963.0  1.253041
    

    Note: row index 4 of df2 has distance from [468528.058460, 4633963.0] to row 0 of df1 [468958.147443, 4633810] is 456.4926432, so it satisfies condition within 500. Therefore, its c0 must not 1 as in your desired ouput.