Search code examples
pythonnumpyhuggingface-transformerssentence-transformersfaiss

How to find the actual sentence from sentence transformer?


I am trying to do semantic search with sentence transformer and faiss.

I am able to generate emebdding from corpus and perform query with the query xq. But what are t

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("flax-sentence-embeddings/st-codesearch-distilroberta-base")

def get_embeddings(code_snippets: str):
    return model.encode(code_snippets)

def build_vector_database(atlas_datapoints):
    dimension = 768  # dimensions of each vector

    corpus = ["tom loves candy",
                    "this is a test"
                    "hello world"
                    "jerry loves programming"]

    code_snippet_emddings = get_embeddings(corpus)
    print(code_snippet_emddings.shape)

    d = code_snippet_emddings.shape[1]
    index = faiss.IndexFlatL2(d)
    print(index.is_trained)

    index.add(code_snippet_emddings)
    print(index.ntotal)

    k = 2
    xq = model.encode(["jerry loves candy"])

    D, I = index.search(xq, k)  # search
    print(I)
    print(D)

This code returns

[[0 1]]
[[1.3480902 1.6274161]]

But I cant find which sentence xq is matching with and not the matching scores only.

How can I find the top-N matching string from the corpus.


Solution

  • To retrieve the query results, try something like this using the variables from your code.

    [corpus[I] for i in I]
    

    But if you have corpus as a np.array object, you can do some cool slicing like this:

    import numpy as np
    
    # If you corpus are in array form.
    corpus = np.array(['abc def', 'foo bar', 'bar bar sheep'])
    
    # And indices can be list of integers.
    indices = [1,0]
    
    # Results.
    corpus[indices]
    

    And it can get a little cooler if your indices are already np.array, like output of faiss, and if you have 2 queries with 1x2xk results:

    import numpy as np
    
    corpus = np.array(['abc def', 'foo bar', 'bar bar sheep'])
    
    indices = np.array([[1,0], [0,2]])
    
    corpus[indices]
    

    Additional Notes

    The faiss.IndexFlatL2 object returns these through the search() function:

    • labels – output labels of the NNs, size n*k
      • i.e. I in your code snippet refers to indices of the top-K results
    • distances – output pairwise distances, size n*k
      • i.e. D in your code snippet referring to the distance of the top-K results from your query string.

    Since you have only 1 query, the n=1, therefore your I and D matrice are of size 1x1xk.

    See also: