Search code examples
solrlangchainlarge-language-model

How to change distance function in `langchain` similarity_search


I have two questions:

  1. How could I change the distance metric directly in the function similarity_search. Because by default the function similarity_search uses euclidean distance and I want e.g. cosine. Ho could I do that?
from eurelis_langchain_solr_vectorstore import Solr

embeddings_model = OpenAIEmbeddings(model="bge-small-en")

vector_store = Solr(embeddings_model, core_kwargs={
    'page_content_field': 'content',  # field containing the text content
    'vector_field': 'content_vec',    # field containing the embeddings of the text content
    'core_name': 'default',         # core name
    'url_base': 'http://localhost:8983/solr' # base url to access solr
})

# here I want to use cosine distance metric
vector_store.similarity_search("relevant question", k=5)

  1. How could I change the distance metric directly in as_retriever?
# here I want to use cosine distance metric
retriever = vector_store.as_retriever(search_kwargs={'k': 5}) 


Solution

  • 1-2. You can't do it that way. The distance function is a parameter you define in the vector database, that is, in Solr (the content_vec field type definition, see example below), and it is not meant to change once the vector field is used (ie. indexed) as for other fields.

    Also, OpenAI embeddings are normalized to unit length, which means that (cf. FAQ) :

    • Cosine similarity and Euclidean distance will result in identical rankings
    • Cosine similarity can be computed slightly faster using just a dot product

    Solr documentation also states that the preferred way to perform cosine similarity is to normalize all vectors to unit length and use dot_product as similarity function rather than cosine (DenseVectorField).

    So for example in Solr schema.xml, you would have the following :

    <fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="1536" similarityFunction="dot_product"/>
    <field name="content_vec" type="knn_vector" indexed="true" stored="true"/>
    

    Note the vectorDimension parameter has to match the number of dimensions of your embedding model (eg. 1536 is the default for text-embedding-3-small, 3072 for text-embedding-3-large, etc.).