Search code examples
pythondatabasenlpartificial-intelligencehaystack

AttributeError: 'list' object has no attribute 'similarity' when using dense passage retriever (pinecone) in Haystack - Python


I'm getting the following error:

"AttributeError: 'list' object has no attribute 'similarity'"

When trying to compile my code. I'm running a NLP Q&A pipeline with Haystack. I've recently tried to implement the pinecone vector database, which is causing the error.

The pipeline until the error is essentially as follows : Initialise Pinecone datastore -> pass data to convert to haystack compatible docs -> pre-process docs -> pass to haystack dense passage retriever.

To simplify the situation I've collected the various modeules and put all the code in a single executable python file and shared below:

import logging
import os
from haystack.document_stores import InMemoryDocumentStore
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
import asyncio
import time
> #from kai.pinecone_system import initiate_pinecone
from haystack import Pipeline

from haystack.document_stores import PineconeDocumentStore
> ####REMOVE
def initiate_pinecone():
print("Testing PInecone")
ENV="eu-west1-gcp"
API="fake-api-key"
document_store = PineconeDocumentStore(
api_key=API,
index='esmo', 
environment=ENV,

>     )
return document_store
> ####REMOVE

LOGGING
logging.basicConfig(
format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING
> )
logging.getLogger("haystack").setLevel(logging.INFO)

DOC STORE
document_store = initiate_pinecone()

from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor
from haystack.utils import convert_files_to_docs

DATA to DOCS
doc_dir = "data/esmo"
> #converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
> #doc_txt = converter.convert(file_path="data/esmo", meta=None)[0]
all_docs = convert_files_to_docs(dir_path=doc_dir)

PRE-PROCESSOR
from haystack.nodes import PreProcessor
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=False,
split_by="word",
split_length=150,
split_respect_sentence_boundary=True,
split_overlap=0
> )
processed_esmo_docs = preprocessor.process(all_docs)
print(f"n_files_input: {len(all_docs)}\nn_docs_output: {len(processed_esmo_docs)}")
print(processed_esmo_docs[0])

write document objects into document store
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)

from haystack.nodes import DensePassageRetriever
retriever = DensePassageRetriever(
document_store=processed_esmo_docs,
>     #document_store=all_docs,
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
max_seq_len_query=64,
max_seq_len_passage=256,
batch_size=2,
use_gpu=True,
embed_title=True,
use_fast_tokenizers=True)

> ##INITIALIZE READER
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="michiyasunaga/BioLinkBERT-large", use_gpu=True)

> ##GET PIPELINE UP (RETRIEVER / READER)
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)
prediction = ""

Many thanks in advance for any advice.

I've tried changing the pinecone vector db to cosine and dotproduct. Altered the pre-processing and also removed the pre-processing which had no effect. I understand thayt the document store is expecting an attribute called similarity, but i'm not sure what that is exactly.


Solution

  • I have modified your question for security reasons.

    In any case, I think you are instantiating the retriever incorrectly.

    As you can see in the documentation, DensePassageRetriever.__init__ expects the document_store parameter, which consists of the document store to be queried; instead, you are incorrectly using the preprocessed documents.

    You should try the following retriever initialization:

    retriever = DensePassageRetriever(
    document_store=document_store,
    ...)