information-retrieval chromadb vector-database retrieval-augmented-generation

chromadb retrieval with metadata filtering is very slow

I have a collection that contains about 300k chunks. I don't think it is a huge amount but the retrieval process is very slow when a metadata filter is applied. It sometimes take up to 180 seconds to retrieve 10 documents, while taking only 2 seconds without filter.

I am using BGElarge as my embedding model and langchain retriever.invoke() as my retrieval function.

Does anyone encounter same situation? I wonder if it is possible to speed up the filtering process such as setting up an index, like what we can do in MongoDB.

I looked up on the internet but it seems that although some people do complain about ChromaDB being slow, so far no one has it as slow as I do.

Below is my code of setting up the retriever

def set_retriever(self, search_type='mmr', search_kwargs={'k':3}):
  """
  used for document retrieval
  search type is defaulted as 'mmr'
  use retriever.invoke(query) to retrieve documents
  """
  self.retriever = self.langchain_chroma.as_retriever(search_type=search_type, search_kwargs=search_kwargs)
    
  return self.retriever

Below is my code for the retrieval

retriever = db.set_retriever(search_kwargs={
  "filter":{
    "publishDate":{'$gte':start_date.timestamp()}
    }
  }
)
    
t1 = time.perf_counter()
results = retriever.invoke("some question")
t2 = time.perf_counter()
print(f"total time taken: ", round(t2-t1,3))
print(results)

I also used chromadb.collection.query(where={"some filter"}) but it didn't help.

Solution

Chroma DB does not currently create indices on metadata. This is still an open issue in their repo as far as I can see. A workaround is to apply filtering manually after performing vector search. I had similar performance issues with only ~50K documents. Personally I would advise using Milvus or Pinecone for non-trivially-sized collections.

See the thread and the open PR