Search code examples
pythonazureazure-machine-learning-servicelangchainazureml-python-sdk

How to load pdf files from Azure Blob Storage with LangChain PyPDFLoader


I currently trying to implement langchain functionality to talk with pdf documents. I have a bunch of pdf files stored in Azure Blob Storage. I am trying to use langchain PyPDFLoader to load the pdf files to the Azure ML notebook. However, I am not being able to get it done. If I have the pdf stored locally, it is no problem, but to scale up I have to connect to the blob store. I have not really found any documents on langchain website or azure website. Wondering, if any of you is having similar problem.

Thank you

Below is an example of code i am trying:

from azureml.fsspec import AzureMachineLearningFileSystem
fs = AzureMachineLearningFileSystem("<path to datastore>")

from langchain.document_loaders import PyPDFLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
    loader = PyPDFLoader(document)
    data = loader.load()

Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject

Another example tried:

from langchain.document_loaders import UnstructuredFileLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
    loader = UnstructuredFileLoader(fd)
documents = loader.load() 

Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject

Solution

  • If you still need an answer, you must convert the blob data into BytesIO object, and save it locally (whether temporarily or forever) before processing the files. Here is how i do it:

    def az_load_files(storage_acc_name, container_name, filenames=None):
      container_client = get_blob_container_client(container_name, storage_acc_name)
      blob_data = []
      for filename in filenames:
          blob_client = container_client.get_blob_client(filename)
          if blob_client.exists():
              blob_data.append(io.BytesIO(blob_client.download_blob().readall()))
      return blob_data
    

    Then create a temp folder for BytesIO objects to be read and 'converted' into their respective document types

    import temp
    
    temp_pdfs = []
    temp_dir = tempfile.mkdtemp()
    for i, byteio in enumerate(ss['loaded_files']):
        file_path = os.path.join(temp_dir, ss['selected_files'][i])
        with open(file_path, 'wb') as file:
            file.write(byteio.getbuffer())
        temp_pdfs.append(file_path)
    

    And use DirectoryLoader to load any type of doc you may have

    from langchain.text_splitter import RecursiveCharacterTextSplitter 
    from langchain.document_loaders import (
      PyPDFLoader,
      DirectoryLoader,
      CSVLoader,
      Docx2txtLoader,
      TextLoader,
      UnstructuredExcelLoader,
      UnstructuredHTMLLoader,
      UnstructuredPowerPointLoader,
      UnstructuredMarkdownLoader,
      JSONLoader
    )
    
    file_type_mappings = {
        '*.txt': TextLoader,
        '*.pdf': PyPDFLoader,
        '*.csv': CSVLoader,
        '*.docx': Docx2txtLoader,
        '*.xlss': UnstructuredExcelLoader,
        '*.xlsx': UnstructuredExcelLoader,
        '*.html': UnstructuredHTMLLoader,
        '*.pptx': UnstructuredPowerPointLoader,
        '*.ppt': UnstructuredPowerPointLoader,
        '*.md': UnstructuredMarkdownLoader,
        '*.json': JSONLoader,
    }
    
    
    docs = []
    
    for glob_pattern, loader_cls in file_type_mappings.items():
        try:
            loader_kwargs = {'jq_schema': '.', 'text_content': False} if loader_cls == JSONLoader else None
            loader_dir = DirectoryLoader(
                temp_dir, glob=glob_pattern, loader_cls=loader_cls, loader_kwargs=loader_kwargs)
            documents = loader_dir.load_and_split()
            text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=800, chunk_overlap=200)
            # for different glob pattern it will split and add texts
            docs += text_splitter.split_documents(documents)
        except Exception as e:
            continue