python pandas scikit-learn sklearn-pandas

nearest member aditional atribute analysis

I have following dataframe df(sample):

         lat        lon  crs   Band1              x             y
0  41.855584  20.619156  b''  1568.0  468388.198606  4.633812e+06
1  41.855584  20.622590  b''  1562.0  468673.173031  4.633811e+06
2  41.855584  20.626023  b''  1605.0  468958.147443  4.633810e+06
3  41.859017  20.612290  b''  1598.0  467819.970900  4.634196e+06
4  41.859017  20.615723  b''  1593.0  468104.930108  4.634195e+06
5  41.859017  20.619156  b''  1600.0  468389.889303  4.634193e+06
6  41.859017  20.622590  b''  1586.0  468674.848486  4.634192e+06
7  41.859017  20.626023  b''  1577.0  468959.807656  4.634191e+06
8  41.859017  20.629456  b''  1584.0  469244.766814  4.634190e+06
9  41.859017  20.632889  b''  1598.0  469529.725959  4.634188e+06

fields x and y are coordinates in xy plane, and Band1 is point elevation ( in essence it is z coordinate ). Dataframe is rectangle grid with x and yas center grid coordinate and Band1 as grid elevation.

How can I detect which of grid cells is highest in Band1 against neighboring cells?

Expected output in this case is additional column in dataframe with boolean value defining that cell is highest in elevation Band1 prior to neighboring 4 cells.

I can easily get neigbouring grid distances and indices with:

X=df[['x','y']].to_numpy()
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)

With Indices output:

array([[0, 1, 5, 6, 4],
       [1, 2, 0, 6, 7],
       [2, 1, 7, 8, 6],
       [3, 4, 5, 0, 6],
       [4, 5, 3, 0, 6],
       [5, 6, 4, 0, 1],
       [6, 7, 5, 1, 2],
       [7, 8, 6, 2, 1],
       [8, 9, 7, 2, 6],
       [9, 8, 7, 2, 6]], dtype=int64)

I can loop though dataframe and compare all members, but its resource consuming since i have 1M rows. Any help is appreciated.

Solution

IIUC, you can use indices to get the corresponding value in the column Band1, then use np.argmax with the parameter axis set to 1 to get the position of the highest value per row. If the value is 0, then it means that the Band1 of this row is higher than the ones of the neighbors like:

df['local_high'] = np.argmax(df['Band1'].to_numpy()[indices], axis=1)==0

and you get

         lat        lon  crs   Band1              x          y  local_high
0  41.855584  20.619156  b''  1568.0  468388.198606  4633812.0       False
1  41.855584  20.622590  b''  1562.0  468673.173031  4633811.0       False
2  41.855584  20.626023  b''  1605.0  468958.147443  4633810.0        True
3  41.859017  20.612290  b''  1598.0  467819.970900  4634196.0       False
4  41.859017  20.615723  b''  1593.0  468104.930108  4634195.0       False
5  41.859017  20.619156  b''  1600.0  468389.889303  4634193.0        True
6  41.859017  20.622590  b''  1586.0  468674.848486  4634192.0       False
7  41.859017  20.626023  b''  1577.0  468959.807656  4634191.0       False
8  41.859017  20.629456  b''  1584.0  469244.766814  4634190.0       False
9  41.859017  20.632889  b''  1598.0  469529.725959  4634188.0       False