python azure azure-machine-learning-service langchain azureml-python-sdk

How to load pdf files from Azure Blob Storage with LangChain PyPDFLoader

I currently trying to implement langchain functionality to talk with pdf documents. I have a bunch of pdf files stored in Azure Blob Storage. I am trying to use langchain PyPDFLoader to load the pdf files to the Azure ML notebook. However, I am not being able to get it done. If I have the pdf stored locally, it is no problem, but to scale up I have to connect to the blob store. I have not really found any documents on langchain website or azure website. Wondering, if any of you is having similar problem.

Thank you

Below is an example of code i am trying:

from azureml.fsspec import AzureMachineLearningFileSystem
fs = AzureMachineLearningFileSystem("<path to datastore>")

from langchain.document_loaders import PyPDFLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
    loader = PyPDFLoader(document)
    data = loader.load()

Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject

Another example tried:

from langchain.document_loaders import UnstructuredFileLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
    loader = UnstructuredFileLoader(fd)
documents = loader.load() 

Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject

Solution

If you still need an answer, you must convert the blob data into BytesIO object, and save it locally (whether temporarily or forever) before processing the files. Here is how i do it:

def az_load_files(storage_acc_name, container_name, filenames=None):
  container_client = get_blob_container_client(container_name, storage_acc_name)
  blob_data = []
  for filename in filenames:
      blob_client = container_client.get_blob_client(filename)
      if blob_client.exists():
          blob_data.append(io.BytesIO(blob_client.download_blob().readall()))
  return blob_data

Then create a temp folder for BytesIO objects to be read and 'converted' into their respective document types

import temp

temp_pdfs = []
temp_dir = tempfile.mkdtemp()
for i, byteio in enumerate(ss['loaded_files']):
    file_path = os.path.join(temp_dir, ss['selected_files'][i])
    with open(file_path, 'wb') as file:
        file.write(byteio.getbuffer())
    temp_pdfs.append(file_path)

And use DirectoryLoader to load any type of doc you may have

from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain.document_loaders import (
  PyPDFLoader,
  DirectoryLoader,
  CSVLoader,
  Docx2txtLoader,
  TextLoader,
  UnstructuredExcelLoader,
  UnstructuredHTMLLoader,
  UnstructuredPowerPointLoader,
  UnstructuredMarkdownLoader,
  JSONLoader
)

file_type_mappings = {
    '*.txt': TextLoader,
    '*.pdf': PyPDFLoader,
    '*.csv': CSVLoader,
    '*.docx': Docx2txtLoader,
    '*.xlss': UnstructuredExcelLoader,
    '*.xlsx': UnstructuredExcelLoader,
    '*.html': UnstructuredHTMLLoader,
    '*.pptx': UnstructuredPowerPointLoader,
    '*.ppt': UnstructuredPowerPointLoader,
    '*.md': UnstructuredMarkdownLoader,
    '*.json': JSONLoader,
}


docs = []

for glob_pattern, loader_cls in file_type_mappings.items():
    try:
        loader_kwargs = {'jq_schema': '.', 'text_content': False} if loader_cls == JSONLoader else None
        loader_dir = DirectoryLoader(
            temp_dir, glob=glob_pattern, loader_cls=loader_cls, loader_kwargs=loader_kwargs)
        documents = loader_dir.load_and_split()
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=800, chunk_overlap=200)
        # for different glob pattern it will split and add texts
        docs += text_splitter.split_documents(documents)
    except Exception as e:
        continue