I have a collection that contains about 300k chunks. I don't think it is a huge amount but the retrieval process is very slow when a metadata filter is applied. It sometimes take up to 180 seconds to retrieve 10 documents, while taking only 2 seconds without filter.
I am using BGElarge as my embedding model and langchain retriever.invoke() as my retrieval function.
Does anyone encounter same situation? I wonder if it is possible to speed up the filtering process such as setting up an index, like what we can do in MongoDB.
I looked up on the internet but it seems that although some people do complain about ChromaDB being slow, so far no one has it as slow as I do.
Below is my code of setting up the retriever
def set_retriever(self, search_type='mmr', search_kwargs={'k':3}):
"""
used for document retrieval
search type is defaulted as 'mmr'
use retriever.invoke(query) to retrieve documents
"""
self.retriever = self.langchain_chroma.as_retriever(search_type=search_type, search_kwargs=search_kwargs)
return self.retriever
Below is my code for the retrieval
retriever = db.set_retriever(search_kwargs={
"filter":{
"publishDate":{'$gte':start_date.timestamp()}
}
}
)
t1 = time.perf_counter()
results = retriever.invoke("some question")
t2 = time.perf_counter()
print(f"total time taken: ", round(t2-t1,3))
print(results)
I also used chromadb.collection.query(where={"some filter"}) but it didn't help.
Chroma DB does not currently create indices on metadata. This is still an open issue in their repo as far as I can see. A workaround is to apply filtering manually after performing vector search. I had similar performance issues with only ~50K documents. Personally I would advise using Milvus or Pinecone for non-trivially-sized collections.
See the thread and the open PR