Search code examples
pythonnumpymachine-learningscikit-learnkdtree

Scikit-learn KDTree query_radius return both count and ind?


I'm trying to return both count (the number of neighbors) and ind (indices of said neighbors) but I can't unless I call query_radius twice, which although computationally intensive, is actually faster for me in Python than iterating through and counting the sizes of each row in ind! This seems terribly inefficient so I wonder is there a way to just return them both in one call?

I tried to access the count and ind objects of tree after calling query_radius but it doesn't exist. There's no efficient way to do this in numpy, is there?

>>> array = np.array([[1,2,3], [2,3,4], [6,2,3]])
>>> tree = KDTree(array)
>>> neighbors = tree.query_radius(array, 1)
>>> tree.ind
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'sklearn.neighbors.kd_tree.KDTree' object has no attribute 'ind'
>>> tree.count
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'sklearn.neighbors.kd_tree.KDTree' object has no attribute 'count'

Solution

  • Consider this dataset:

    array = np.random.random((10**5, 3))*10
    tree = KDTree(array)
    

    There are 3 options as you identify in your question:

    1) Call tree.query_radius twice to get the neighbors and their counts.

    neighbors = tree.query_radius(array, 1)
    counts = tree.query_radius(array, 1, count_only=1)
    

    This takes 8.347s.

    2) Get only the neighbors and then get the counts by iterating through them:

    neighbors = tree.query_radius(array, 1)
    counts = []
    for i in range(len(neighbors)):
        counts.append(len(neighbors[i]))
    

    This is significantly faster than the first method and takes 4.697s

    3) Now, we can improve the looping time to calculate counts.

    neighbors = tree.query_radius(array, 1)
    len_array = np.frompyfunc(len, 1, 1)
    counts = len_array(neighbors)
    

    This is the fastest with 4.449s.