Search code examples
pythonsqlpandascountnearest-neighbor

How to improve the time of execution of this for loop?


I have a DF called "zone" with x and y columns as integers, that can be interpreted as the position of points. I need to compute the number of first and second neighbors, and I have written this:

import numpy as np
import pandas as pd

data = np.random.randint(1000,6000,size=(600000,2))
zone = pd.DataFrame(data, columns=['x', 'y']).drop_duplicates()

a=[]
for i,row in zone.iterrows(): 
    x = row.x
    y = row.y
    num_1st_neigh = len(zone[(zone.x>=(x-1))&(zone.x<=(x+1))&(zone.y>=(y-1))&(zone.y<=(y+1))])-1
    num_2nd_neigh = (len(zone[(zone.x>=(x-2))&(zone.x<=(x+2))&(zone.y>=(y-2))&(zone.y<=(y+2))])-1)\
    -(num_1st_neigh)
    a.append([i,num_1st_neigh,num_2nd_neigh])
a = pd.DataFrame(a, columns = ['index','num_1st_neigh','num_2nd_neigh'])
zzz = zone.reset_index().merge(a,on='index')

This works good but lasts 15 s on 3K points, I have 1M points and it's stll running after 2h. Any ideas on how can I improve the execution speed?

I have read that iterrows is very slow, but I don't know how else I could do it.

EDIT: I also tried the same using SQL, but also the execution time is >2h and the query returns a timeout:

SELECT t0.x,
    t0.y,
    count_if(greatest(abs(t0.x-t1.x), abs(t0.y-t1.y)) = 1) num_1_neighbors,
    count_if(greatest(abs(t0.x-t1.x), abs(t0.y-t1.y)) = 2) num_2_neighbors
FROM "table" t0 
    left join "table" t1 on t1.x between t0.x -2 and t0.x + 2
    and t1.y between t0.y -2 and t0.y + 2
    and (
        t1.x <> t0.x
        or t1.y <> t0.y
    )
group by 1,2

Any ideas both using SQL or pandas are very welcome


Solution

  • You can use BallTree from sklearn:

    from sklearn.neighbors import BallTree
    
    xy = zone[['x', 'y']]
    tree = BallTree(xy, metric='euclidean')
    num_1st_neigh = tree.query_radius(xy, r=1*np.sqrt(2), count_only=True) - 1
    num_2nd_neigh = tree.query_radius(xy, r=2*np.sqrt(2), count_only=True) - 1 - num_1st_neigh
    zone['num_1st_neigh'] = num_1st_neigh
    zone['num_2nd_neigh'] = num_2nd_neigh
    

    From a small example:

    # BallTree
    >>> zone
         x    y  num_1st_neigh  num_2nd_neigh
    0  106  115              0              0
    1  118  104              0              0
    2  119  114              0              0
    3  108  103              0              2
    4  103  101              0              0
    5  110  105              0              1
    6  103  104              0              0
    7  102  119              0              0
    8  106  105              0              1
    9  111  114              0              0
    
    # Your code
    >>> zzz
       index    x    y  num_1st_neigh  num_2nd_neigh
    0      0  106  115              0              0
    1      1  118  104              0              0
    2      2  119  114              0              0
    3      3  108  103              0              2
    4      4  103  101              0              0
    5      5  110  105              0              1
    6      6  103  104              0              0
    7      7  102  119              0              0
    8      8  106  105              0              1
    9      9  111  114              0              0