Search code examples
pythonmilvusragretrieval-augmented-generation

Creating an index in PyMilvus 2.5.x does not actually index any rows


I am trying to create an index on text embeddings for a RAG system with Milvus 2.5.x as vector database in Python. I have already create the collections and populated them. My dataset size is quite small as this is a research project: one collection with 500 rows and another with 53 rows.

My current setup is as follows:

from pymilvus import MilvusClient 
client = MilvusClient('../data/task_embeddings.db') 
client.load_collection('collection')

client.drop_index('collection', 'problem_statement_embeddings') # Ensure clean precondition before trying to create index
client.describe_index('collection', 'problem_statement_embeddings') # Check whether the last statement worked as expected

index_params = MilvusClient.prepare_index_params()
index_params.add_index(
    index_name='problem_statement_embeddings',
    field_name="vector",
    index_type="FLAT",
    metric_type="COSINE", 
)
client.create_index('collection', index_params, sync=True)

This code runs through fine. However, when I then try to check the index with client.describe_index('collection', 'problem_statement_embeddings') I get the following output:

{'index_type': 'FLAT',
 'metric_type': 'COSINE',
 'dim': '768',
 'field_name': 'vector',
 'index_name': 'problem_statement_embeddings',
 'total_rows': 0,
 'indexed_rows': 0,
 'pending_index_rows': 0,
 'state': 'Finished'}

Indicating that no rows were indexed. If I run a search query, I do still get a result. I suppose at my dataset size it does not matter too much whether the data is indexed, but I would still like to understand what is going on here to ensure that I dont run into unexpected behaviour later.

Edit: I have opened an issue in the Milvus repo


Solution

  • This is just because the FLAT index is a brute-force operation and no indexing is performed, this is expected according to the maintainers. However it is pretty poor UX that this isnt documented.