Search code examples
pythoncosine-similaritysentence-similaritychromadbvector-database

Searching existing ChromaDB database using cosine similarity


I have a preexisting database with around 15 PDFs stored. I want to be able to search the database so that I'm getting the X most relevant results back given a certain threshold using cosine similarity.

Currently, I've defined a collection using this code:

chroma_client = chromadb.PersistentClient(path="TEST_EMBEDDINGS/CHUNK_EMBEDDINGS")
collection = chroma_client.get_or_create_collection(name="CHUNK_EMBEDDINGS")

I've done a bit of research and it seems to me that while ChromaDB does not have a similarity search, FAISS does. However, the existing solutions online describe to do something along the lines of this:

from langchain.vectorstores import Chroma
db = Chroma.from_documents(texts, embeddings)
docs_score = db.similarity_search_with_score(query=query, distance_metric="cos", k = 6)

I am unsure how I can integrate this code or if there are better solutions.


Solution

  • ChromaDB does have similarity search. The default is L2, but you can change it as documented here.

    collection = client.create_collection(
        name="collection_name",
        metadata={"hnsw:space": "cosine"} # l2 is the default
    )