Search code examples
pythonvectorizationembeddinglangchainlarge-language-model

Issues with Loading and Vectorizing Multiple PDFs using Langchain


I am trying to use VectorstoreIndexCreator().from_loaders(loaders) from the langchain package, where loaders is a list of UnstructuredPDFLoader instances, each intended to load a different PDF file. However, I am encountering an UnboundLocalError related to a local variable isalnum.

Here’s the relevant part of the error traceback:

File …/site-packages/unstructured/documents/elements.py:1007, in process_metadata….
UnboundLocalError: local variable 'isalnum' referenced before assignment

Here's a simplified version of my code:

from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator

loaders = [UnstructuredPDFLoader(filepath) for filepath in filepaths]
index = VectorstoreIndexCreator().from_loaders(loaders)

Interestingly, when I use WebBaseLoader to load a web document instead of a PDF, the code works perfectly:

from langchain.document_loaders import WebBaseLoader
from langchain.indexes import VectorstoreIndexCreator

loader = WebBaseLoader("https://example.com")
index = VectorstoreIndexCreator().from_loaders([loader])

Questions:

  1. Has anyone encountered a similar issue with UnstructuredPDFLoader from langchain, and if so, how did you resolve it?

Solution

  • The issue has been resolved and the "VectorstoreIndexCreator" is now working again.