Issue with loading online pdf in python notebook using langchain PyPDFLoader

I am trying to load with python langchain library an online pdf from: http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf

This is the code that I'm running locally:

loader = PyPDFLoader(datasheet_path)
pages  = loader.load_and_split()

Am getting the following error
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
Cell In[4], line 8
      6 datasheet_path = "http://datasheet.octopart.com/CL05B683KO5NNNC-Samsung-Electro-Mechanics-datasheet-136482222.pdf"
      7 loader = PyPDFLoader(datasheet_path)
----> 8 pages = loader.load_and_split()
     11 query = """

File ***\.venv\lib\site-packages\langchain\document_loaders\base.py:36, in BaseLoader.load_and_split(self, text_splitter)
     34 else:
     35     _text_splitter = text_splitter
---> 36 docs = self.load()
     37 return _text_splitter.split_documents(docs)
...
   (...)
    114         for i, page in enumerate(pdf_reader.pages)
    115     ]

PermissionError: [Errno 13] Permission denied: 'C:\\Users\\****\\AppData\\Local\\Temp\\tmpu_59ngam'

Note1: running the same code in google Colab works well Note2: running the following code in the same notebook is working correctly so I'm not sure access to the temp folder is problematic in any manner:

with open('C:\\Users\\benis\\AppData\\Local\\Temp\\test.txt', 'w') as h:
    h.write("test")

Note3: I have tested several different online pdf. got same error for all.

The code should covert pdf to text and split to pages using Langchain and pyplot

Solution

You will not succeed with this task using langchain on windows with their current implementation. You can take a look at the source code here. Consider the following abridged code:

class BasePDFLoader(BaseLoader, ABC):
    def __init__(self, file_path: str):
        ...
        # If the file is a web path, download it to a temporary file, and use that
        if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path):
            r = requests.get(self.file_path)

            ...
            self.web_path = self.file_path
            self.temp_file = tempfile.NamedTemporaryFile()
            self.temp_file.write(r.content)
            self.file_path = self.temp_file.name
            ...

    def __del__(self) -> None:
        if hasattr(self, "temp_file"):
            self.temp_file.close()

Note that they open the file in the constructor, and close it in the destructor. Now let's look at the python documentation on NamedTemporaryFile (emphasis mine, docs are for python3.9):

This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object. Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows).