Search code examples
pythonlangchaindoc

Read a stream of a Word document (.doc) in Python


I'm trying to read a Word document (.doc) to create a CustomWordLoader for LangChain. I'm currently able to read .docx files using the Python-docx package.

The stream is created by reading a word document from a Sharepoint site.

Here is code for docs:

class CustomWordLoader(BaseLoader):
    """
    This class is a custom loader for Word documents. It extends the BaseLoader class and overrides its methods.
    It uses the python-docx library to parse Word documents and optionally splits the text into manageable documents.
    
    Attributes:
    stream (io.BytesIO): A binary stream of the Word document.
    filename (str): The name of the Word document.
    """
    def __init__(self, stream, filename: str):
        # Initialize with a binary stream and filename
        self.stream = stream
        self.filename = filename

    def load_and_split(self, text_splitter=None):
        # Use python-docx to parse the Word document from the binary stream
        doc = DocxDocument(self.stream)
        # Extract and concatenate all paragraph texts into a single string
        text = "\n".join([p.text for p in doc.paragraphs])

        # Check if a text splitter utility is provided
        if text_splitter is not None:
            # Use the provided splitter to divide the text into manageable documents
            split_text = text_splitter.create_documents([text])
        else:
            # Without a splitter, treat the entire text as one document
            split_text = [{'text': text, 'metadata': {'source': self.filename}}]

        # Add source metadata to each resulting document
        for doc in split_text:
            if isinstance(doc, dict):
                doc['metadata'] = {**doc.get('metadata', {}), 'source': self.filename}
            else:
                doc.metadata = {**doc.metadata, 'source': self.filename}

        return split_text

My solution will be deployed on a Docker using "3.11.8-alpine3.18" (a slim version of unix).

For security reasons, I can't download the file locally, so I really need to able to read the stream like my example: doc = DocxDocument(self.stream)

I tried to find the equivalent package to Python-docx that is able to read a .docx but not a .doc.


Solution

  • I was able to do it using Textract. I have to save the stream in a file locally, but that's the only way I found.

    here is my code:

    class CustomWordLoader(BaseLoader):
    """
    A custom loader for Word documents, extending BaseLoader. It reads Word documents from a binary stream,
    writes them temporarily to disk, and uses textract to extract text. If textract fails, an exception is raised.
    """
    def __init__(self, stream, filename: str):
        self.stream = stream
        self.filename = filename
    
    def load_and_split(self, text_splitter=None):
        # Generate a unique filename
        temp_filename = str(uuid.uuid4()) + '.doc'
        
        # Create a temporary directory
        temp_dir = os.path.join(os.getcwd(), 'temp')
        os.makedirs(temp_dir, exist_ok=True)
        
        # Full path to the temporary file
        temp_file_path = os.path.join(temp_dir, temp_filename)
        
        # Write the content of the stream into the temporary file
        with open(temp_file_path, 'wb') as f:
            f.write(self.stream.read())
        
        # Use textract to extract the text from the file
        text = textract.process(temp_file_path).decode('utf-8')
        
        if text_splitter is not None:
            split_text = text_splitter.create_documents([text])
        else:
            split_text = [{'text': text, 'metadata': {'source': self.filename}}]
    
        for doc in split_text:
            if isinstance(doc, dict):
                doc['metadata'] = {**doc.get('metadata', {}), 'source': self.filename}
            else:
                doc.metadata = {**doc.metadata, 'source': self.filename}
    
        # Remove the temporary file
        os.remove(temp_file_path)
    
        return split_text
    

    I hope this can help someone!