Search code examples
pythonindexinginformation-retrievalfaisshaystack

How can use Haystack to identify the top k sentences that are the closest match to a user query, and then returns the docs containing these sentences?


I have a set of 1000 documents (plain texts) and one user query. I want to retrieve the top k documents that are the most relevant to a user query using the Python library Haystack and Faiss. Specially, I want the system to identify the top k sentences that are the closest match to the user query, and then returns the documents that contain these sentences. How can I do so?

The following code identifies the top k documents that are the closest match to the user query. How can I change it so that instead, the code identifies the top k sentences that are the closest match to the user query, and returns the documents that contain these sentences.

# Note: Most of the code is from https://haystack.deepset.ai/tutorials/07_rag_generator

import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

import pandas as pd
from haystack.utils import fetch_archive_from_http

# Download sample
doc_dir = "data/tutorial7/"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/small_generator_dataset.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Create dataframe with columns "title" and "text"
#df = pd.read_csv(f"{doc_dir}/small_generator_dataset.csv", sep=",")
df = pd.read_csv(f"{doc_dir}/small_generator_dataset.csv", sep=",",nrows=10)
# Minimal cleaning
df.fillna(value="", inplace=True)

print(df.head())

from haystack import Document

# Use data to initialize Document objects
titles = list(df["title"].values)
texts = list(df["text"].values)
documents = []
for title, text in zip(titles, texts):
    documents.append(Document(content=text, meta={"name": title or ""}))

from haystack.document_stores import FAISSDocumentStore
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", return_embedding=True)

from haystack.nodes import RAGenerator, DensePassageRetriever

retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    use_gpu=True,
    embed_title=True,
)

# Delete existing documents in documents store
document_store.delete_documents()

# Write documents to document store
document_store.write_documents(documents)

# Add documents embeddings to index
document_store.update_embeddings(retriever=retriever)

from haystack.pipelines import GenerativeQAPipeline
from haystack import Pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name='Retriever', inputs=['Query'])

from haystack.utils import print_answers

QUESTIONS = [
    "who got the first nobel prize in physics",
    "when is the next deadpool movie being released",
]

for question in QUESTIONS:
    res = pipeline.run(query=question, params={"Retriever": {"top_k": 5}})
    print(res)
    #print_answers(res, details="all")

To run the code:

conda create -y --name haystacktest python==3.9
conda activate haystacktest
pip install --upgrade pip
pip install farm-haystack
conda install pytorch -c pytorch
pip install sentence_transformers
pip install farm-haystack[colab,faiss]==1.17.2

E.g., I wonder if there is a way to amend the Faiss indexing strategy.


Solution

  • As Stefano Fiorucci - anakin87 and bilge suggested, one can add metadata to the documents being indexed by in the vector database. Therefore, one can index each sentence in the vector database, and use the metadata to link each sentence back to their original document.

    Here is bilge's full answer:

    The first thing you need to do is split your documents by sentence. You can easily do this eith PreProcessor. It will be split_by="sentence" and split_length=1. Then top_k will retrieve top k similar sentences 🙂 You can then iterate over the documents and add title to their contents if you want

    Before you split your documents and write them to the document store, you can add the title or any identifier for each document to meta field so that you can go and get the full document later on.

    Also, that tutorial is outdated [https://haystack.deepset.ai/tutorials/07_rag_generator], check this one if you're doing RAG: https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode


    Here is an example of code where a vector store is created using metadata with langchain (not haystack, but the same principle applies):

    import pprint
    from langchain_community.vectorstores import FAISS
    from langchain_community.embeddings import HuggingFaceEmbeddings
    from langchain.docstore.document import Document
    
    model = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
    embeddings = HuggingFaceEmbeddings(model_name = model)
    
    def main():
        doc1 = Document(page_content="The sky is blue.",    metadata={"document_id": "10"})
        doc2 = Document(page_content="The forest is green", metadata={"document_id": "62"})
        docs = []
        docs.append(doc1)
        docs.append(doc2)
    
        for doc in docs:
            doc.metadata['summary'] = 'hello'
    
        pprint.pprint(docs)
        db = FAISS.from_documents(docs, embeddings)
        db.save_local("faiss_index")
        new_db = FAISS.load_local("faiss_index", embeddings)
    
        query = "Which color is the sky?"
        docs = new_db.similarity_search_with_score(query)
        print('Retrieved docs:', docs)
        print('Metadata of the most relevant document:', docs[0][0].metadata)
    
    if __name__ == '__main__':
        main()
    

    Tested with Python 3.11 with:

    pip install langchain==0.1.1 langchain_openai==0.0.2.post1 sentence-transformers==2.2.2 langchain_community==0.0.13 faiss-cpu==1.7.4