Search code examples
azureazure-blob-storagedatabrickslangchain

Langchain PyPDFLoader read from Azure Blob Storage mount point in Azure Databricks


I am working on Azure Databricks and trying to read a PDF file located in Azure Blob Storage. Also, I am using Langchain PyPDFLoader to read the PDF. According to the examples I have checked, PyPDFLoader gets in the path of the PDF file, so I provide the path in the mount point, but it does not work.

Here, I show the code snapshot:

blob_path = f"wasbs://{container_name}@{storage_account}.blob.core.windows.net"
account_url = f"fs.azure.account.key.{storage_account}.blob.core.windows.net"
account_key = "******************"


dbutils.fs.mount(source = blob_path,
             mount_point = f"/mnt/{mount_point}/",
             extra_configs = {account_url: account_key})

loader = PyPDFLoader(f'/mnt/{mount_point}/raw-docs/doc.pdf')
pages = loader.load()

I get the following error:

"ValueError: File path /mnt/mpoint/raw-docs/doc.pdf is not a valid file or url"

I have tried changing the PDF location and also with some other functions to read the file, but still get the same error.


Solution

  • You need to prefix the pdf path with dbfs since your not using any spark context. The PyPDFLoader searches the path through driver filesystem so you need to give path from root. /dbfs/mount_path.

    Alter the path like below.

      loader = PyPDFLoader(f'/dbfs/mnt/{mount_point}/testfolder/sample.pdf')
      pages = loader.load()
      pages[0]
    

    Output: enter image description here

    Refer this documentation to know more about working with files in databricks.