Search code examples
pythonpdfpypdflangchain

Issue with loading online pdf in python notebook using langchain PyPDFLoader


I am trying to load with python langchain library an online pdf from: http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf

This is the code that I'm running locally:

loader = PyPDFLoader(datasheet_path)
pages  = loader.load_and_split()
Am getting the following error
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
Cell In[4], line 8
      6 datasheet_path = "http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf"
      7 loader = PyPDFLoader(datasheet_path)
----> 8 pages = loader.load_and_split()
     11 query = """

File ***\.venv\lib\site-packages\langchain\document_loaders\base.py:36, in BaseLoader.load_and_split(self, text_splitter)
     34 else:
     35     _text_splitter = text_splitter
---> 36 docs = self.load()
     37 return _text_splitter.split_documents(docs)
...
   (...)
    114         for i, page in enumerate(pdf_reader.pages)
    115     ]

PermissionError: [Errno 13] Permission denied: 'C:\\Users\\****\\AppData\\Local\\Temp\\tmpu_59ngam'

Note1: running the same code in google Colab works well Note2: running the following code in the same notebook is working correctly so I'm not sure access to the temp folder is problematic in any manner:

with open('C:\\Users\\benis\\AppData\\Local\\Temp\\test.txt', 'w') as h:
    h.write("test")

Note3: I have tested several different online pdf. got same error for all.

The code should covert pdf to text and split to pages using Langchain and pyplot


Solution

  • You will not succeed with this task using langchain on windows with their current implementation. You can take a look at the source code here. Consider the following abridged code:

    class BasePDFLoader(BaseLoader, ABC):
        def __init__(self, file_path: str):
            ...
            # If the file is a web path, download it to a temporary file, and use that
            if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
                r = requests.get(self.file_path)
    
                ...
                self.web_path = self.file_path
                self.temp_file = tempfile.NamedTemporaryFile()
                self.temp_file.write(r.content)
                self.file_path = self.temp_file.name
                ...
    
        def __del__(self) -> None:
            if hasattr(self, "temp_file"):
                self.temp_file.close()
    

    Note that they open the file in the constructor, and close it in the destructor. Now let's look at the python documentation on NamedTemporaryFile (emphasis mine, docs are for python3.9):

    This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object. Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows).