I'm getting the following error:
"AttributeError: 'list' object has no attribute 'similarity'"
When trying to compile my code. I'm running a NLP Q&A pipeline with Haystack. I've recently tried to implement the pinecone vector database, which is causing the error.
The pipeline until the error is essentially as follows : Initialise Pinecone datastore -> pass data to convert to haystack compatible docs -> pre-process docs -> pass to haystack dense passage retriever.
To simplify the situation I've collected the various modeules and put all the code in a single executable python file and shared below:
import logging
import os
from haystack.document_stores import InMemoryDocumentStore
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
import asyncio
import time
> #from kai.pinecone_system import initiate_pinecone
from haystack import Pipeline
from haystack.document_stores import PineconeDocumentStore
> ####REMOVE
def initiate_pinecone():
print("Testing PInecone")
ENV="eu-west1-gcp"
API="fake-api-key"
document_store = PineconeDocumentStore(
api_key=API,
index='esmo',
environment=ENV,
> )
return document_store
> ####REMOVE
LOGGING
logging.basicConfig(
format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING
> )
logging.getLogger("haystack").setLevel(logging.INFO)
DOC STORE
document_store = initiate_pinecone()
from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor
from haystack.utils import convert_files_to_docs
DATA to DOCS
doc_dir = "data/esmo"
> #converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
> #doc_txt = converter.convert(file_path="data/esmo", meta=None)[0]
all_docs = convert_files_to_docs(dir_path=doc_dir)
PRE-PROCESSOR
from haystack.nodes import PreProcessor
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=False,
split_by="word",
split_length=150,
split_respect_sentence_boundary=True,
split_overlap=0
> )
processed_esmo_docs = preprocessor.process(all_docs)
print(f"n_files_input: {len(all_docs)}\nn_docs_output: {len(processed_esmo_docs)}")
print(processed_esmo_docs[0])
write document objects into document store
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)
from haystack.nodes import DensePassageRetriever
retriever = DensePassageRetriever(
document_store=processed_esmo_docs,
> #document_store=all_docs,
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
max_seq_len_query=64,
max_seq_len_passage=256,
batch_size=2,
use_gpu=True,
embed_title=True,
use_fast_tokenizers=True)
> ##INITIALIZE READER
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="michiyasunaga/BioLinkBERT-large", use_gpu=True)
> ##GET PIPELINE UP (RETRIEVER / READER)
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)
prediction = ""
Many thanks in advance for any advice.
I've tried changing the pinecone vector db to cosine and dotproduct. Altered the pre-processing and also removed the pre-processing which had no effect. I understand thayt the document store is expecting an attribute called similarity, but i'm not sure what that is exactly.
I have modified your question for security reasons.
In any case, I think you are instantiating the retriever incorrectly.
As you can see in the documentation, DensePassageRetriever.__init__
expects the document_store
parameter, which consists of the document store to be queried; instead, you are incorrectly using the preprocessed documents.
You should try the following retriever initialization:
retriever = DensePassageRetriever(
document_store=document_store,
...)