python numpy machine-learning scikit-learn kdtree

Scikit-learn KDTree query_radius return both count and ind?

I'm trying to return both count (the number of neighbors) and ind (indices of said neighbors) but I can't unless I call query_radius twice, which although computationally intensive, is actually faster for me in Python than iterating through and counting the sizes of each row in ind! This seems terribly inefficient so I wonder is there a way to just return them both in one call?

I tried to access the count and ind objects of tree after calling query_radius but it doesn't exist. There's no efficient way to do this in numpy, is there?

>>> array = np.array([[1,2,3], [2,3,4], [6,2,3]])
>>> tree = KDTree(array)
>>> neighbors = tree.query_radius(array, 1)
>>> tree.ind
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'sklearn.neighbors.kd_tree.KDTree' object has no attribute 'ind'
>>> tree.count
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'sklearn.neighbors.kd_tree.KDTree' object has no attribute 'count'

Solution

Consider this dataset:

array = np.random.random((10**5, 3))*10
tree = KDTree(array)

There are 3 options as you identify in your question:

1) Call tree.query_radius twice to get the neighbors and their counts.

neighbors = tree.query_radius(array, 1)
counts = tree.query_radius(array, 1, count_only=1)

This takes 8.347s.

2) Get only the neighbors and then get the counts by iterating through them:

neighbors = tree.query_radius(array, 1)
counts = []
for i in range(len(neighbors)):
    counts.append(len(neighbors[i]))

This is significantly faster than the first method and takes 4.697s

3) Now, we can improve the looping time to calculate counts.

neighbors = tree.query_radius(array, 1)
len_array = np.frompyfunc(len, 1, 1)
counts = len_array(neighbors)

This is the fastest with 4.449s.