I am looking at SciPy's KD Tree implementation. I am confused by the leafsize
parameter, which is said to be
The number of points at which the algorithm switches over to brute-force. Default: 16.
Is this the number of leaves in the BST? If so, that means by default, the KD tree will contain no more than 32 points. This seems unreasonably small, especially for my use case of k=2. Am I interpreting the parameter incorrectly?
The parameter leafsize
does not control the maximum number of leaves in the tree (which isn't bounded), but rather the number of input points associated with a single leaf in the tree. Consider 64 points being added to a tree. If each point gets associated with a single leaf, you will get a full tree that is seven levels deep and any processing requiring a point will descend these seven levels to find it. Alternatively, if the leafsize is 16, you will get a tree that is only 3 levels deep and each leaf in the tree is associated with 16 points. Processing a point involves descending three levels of the tree and then testing each of the 16 points in the leaf (since they are unordered). This value is can be tuned for performance, and empirically, the best value depends on how exactly the tree is being used.
Conceptually, here is an example of the trees that would be built for different leafsize
values.
You can see how the leafsize
parameter short-circuits the tree building process in the scipy source code. This will leave between leafsize/2
and leafsize
points unordered in the leafs of the tree.