Search code examples
pythonllamallama-indexpineconerag

Error saying "AttributeError: 'Document' object has no attribute 'get_doc_id'"


I'm working on using Predibase and LlamaIndex to set up all the moving parts of an RAG system, with Predibase as the LLM provider and currently I'm trying to create the index so that any query I make will pull the relevant context from my Pinecone Vector Store.

The following Python code is an example of achieving this objective.

index = VectorStoreIndex.from_documents(documents, storage_context=pinecone_storage_context)

During this specific process, in this case, the type of the variable 'documents' that I'm passing over is 'list' but this makes an error saying "AttributeError: 'Document' object has no attribute 'get_doc_id'".

So is there any way to solve this issue? This is the resource that I'm currently following right now.

Colab Python code

# Extract Filings Function
def get_filings(ticker):
    global sec_api_key

    # Finding Recent Filings with QueryAPI
    queryApi = QueryApi(api_key=sec_api_key)
    query = {
      "query": f"ticker:{ticker} AND formType:\"10-K\"",
      "from": "0",
      "size": "1",
      "sort": [{ "filedAt": { "order": "desc" } }]
    }
    filings = queryApi.get_filings(query)

    # Getting 10-K URL
    filing_url = filings["filings"][0]["linkToFilingDetails"]

    # Extracting Text with ExtractorAPI
    extractorApi = ExtractorApi(api_key=sec_api_key)
    onea_text = extractorApi.get_section(filing_url, "1A", "text") # Section 1A - Risk Factors
    seven_text = extractorApi.get_section(filing_url, "7", "text") # Section 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations

    # Joining Texts
    combined_text = onea_text + "\n\n" + seven_text

    return combined_text
# construct vector store and custom storage context
pinecone_index = pc.Index("predibase-demo-hf")
pincone_vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
pinecone_storage_context = StorageContext.from_defaults(vector_store=pincone_vector_store)


# Prompt the user to input the stock ticker they want to analyze
ticker = input("What Ticker Would you Like to Analyze? ex. AAPL: ")

print("-----")
print("Getting Filing Data")
# Retrieve the filing data for the specified ticker
filing_data = get_filings(ticker)

print("-----")
print("Initializing Vector Database")
# Initialize a text splitter to divide the filing data into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,         # Maximum size of each chunk
    chunk_overlap = 500,       # Number of characters to overlap between chunks
    length_function = len,     # Function to determine the length of the chunks
    is_separator_regex = False # Whether the separator is a regex pattern
)
# Split the filing data into smaller, manageable chunks
split_data = text_splitter.create_documents([filing_data])


# Load in the documents you want to index
documents = split_data
index = VectorStoreIndex.from_documents(documents, storage_context=pinecone_storage_context)

Error

img1


Solution

  • Solution

    from llama_index.core import Document, VectorStoreIndex
    
    # Split the filing data into smaller, manageable chunks
    split_data = str(text_splitter.create_documents([filing_data]))
    
    
    # Here we create the index so that any query you make will pull the relevant context from your Vector Store.
    
    # text_list = [text1, text2, ...]
    documents = [Document(text=t) for t in split_data]
    
    # build index
    index = VectorStoreIndex.from_documents(documents, storage_context=pinecone_storage_context)
    

    The problem was VectorStoreIndex wanted me to pass certain types that can be processed.

    I was working on finding any way to make the split_data into a certain format that VectorStoreIndex can utilize and in the end, I've just wrapped the text_splitter.create_documents([filing_data]) with the str() function and created documents manually with adding the following code.

    documents = [Document(text=t) for t in split_data]

    Hope this helps somebody facing a similar issue like me.