Search code examples
databaseartificial-intelligencevector-databasemilvus

Filtering results by gender: Adding a boolean field schema does not enhance search speed


I have created a collection with the following specifications:

  • Milvus Version: 2.4.4
  • CPU
  • Number of Entities: 20 million
  • Vector Field: One field of type float with a dimension of 512
  • Boolean Field: Represents gender, with a 50% probability for both male and female
  • Metric: COSINE
  • M: 64
  • efconstruction: 256
  • ef: 128
  • Index Type: HNSW
  • I did not configure values for partition, segment, or num_shards. In my initial benchmark, I evaluated Milvus's performance against Numpy's dot product and was pleased with the results.

Now, I want to add an additional field schema that also contains a boolean value indicating the gender of each embedding vector, allowing me to restrict queries based on gender. For instance, I aim to retrieve the 50 nearest neighbors that are male. To achieve this, I will generate gender data with an equal probability of 50%, resulting in half of the collection being male and the other half female. I conducted benchmarks under this scenario, and the findings are outlined below. As illustrated in the plot, filtering results by gender did not confer any advantages; for example, in one case, the filtering was only 1.06 times faster than non-filtered queries.


Solution

  • Adding index might not be that helpful and only improves less than 50% performance in your case (low cardinality field), and most time will be spent on HNSW. In fact, boolean filtering itself is super fast( 1ms<) and doesn't really need any index.