I am trying to perform ANN, but my data is split into partitions or "tenants." Searches are always restricted to a single tenant, which represents a small percentage of the total documents.
I first tried implementing this using a filter on a tenant string attribute. However, I encountered this piece of documentation, that suggests the performance will be poor:
There is a small problem here however. If the eligibility list is small in relation to the number of items in the graph, skipping occurs with a high probability. This means that the algorithm needs to consider an exponentially increasing number of candidates, slowing down the search significantly. To solve this, Vespa.ai switches over to a brute-force search when this occurs. The result is a efficient ANN search when combined with filters.
What's the best way to solve my problem? Will partitioning my data into separate namespaces trigger the creation of a separate HNSW graph per namespace?
Performance will be fine, the query planner will just choose to not use the ANN index for these queries. You'll find lots of details on this topic, including how to tune this, in this blog post: https://blog.vespa.ai/constrained-approximate-nearest-neighbor-search/
If all your queries are towards a single tenant which is a small percentage of the total documents I don't think you necessarily need to create an HNSW index at all, but this depends on the absolute numbers and the largest "small percentage".
(Namespaces are not relevant here - their only purpose is to safely add a string to ids so that you can have multiple sources of ids and still be guaranteed global uniqueness.)