Read a stream of a Word document (.doc) in Python

I'm trying to read a Word document (.doc) to create a CustomWordLoader for LangChain. I'm currently able to read .docx files using the Python-docx package.

The stream is created by reading a word document from a Sharepoint site.

Here is code for docs:

class CustomWordLoader(BaseLoader):
    """
    This class is a custom loader for Word documents. It extends the BaseLoader class and overrides its methods.
    It uses the python-docx library to parse Word documents and optionally splits the text into manageable documents.
    
    Attributes:
    stream (io.BytesIO): A binary stream of the Word document.
    filename (str): The name of the Word document.
    """
    def __init__(self, stream, filename: str):
        # Initialize with a binary stream and filename
        self.stream = stream
        self.filename = filename

    def load_and_split(self, text_splitter=None):
        # Use python-docx to parse the Word document from the binary stream
        doc = DocxDocument(self.stream)
        # Extract and concatenate all paragraph texts into a single string
        text = "\n".join([p.text for p in doc.paragraphs])

        # Check if a text splitter utility is provided
        if text_splitter is not None:
            # Use the provided splitter to divide the text into manageable documents
            split_text = text_splitter.create_documents([text])
        else:
            # Without a splitter, treat the entire text as one document
            split_text = [{'text': text, 'metadata': {'source': self.filename}}]

        # Add source metadata to each resulting document
        for doc in split_text:
            if isinstance(doc, dict):
                doc['metadata'] = {**doc.get('metadata', {}), 'source': self.filename}
            else:
                doc.metadata = {**doc.metadata, 'source': self.filename}

        return split_text

My solution will be deployed on a Docker using "3.11.8-alpine3.18" (a slim version of unix).

For security reasons, I can't download the file locally, so I really need to able to read the stream like my example: doc = DocxDocument(self.stream)

I tried to find the equivalent package to Python-docx that is able to read a .docx but not a .doc.

Solution

I was able to do it using Textract. I have to save the stream in a file locally, but that's the only way I found.

here is my code:

class CustomWordLoader(BaseLoader):
"""
A custom loader for Word documents, extending BaseLoader. It reads Word documents from a binary stream,
writes them temporarily to disk, and uses textract to extract text. If textract fails, an exception is raised.
"""
def __init__(self, stream, filename: str):
    self.stream = stream
    self.filename = filename

def load_and_split(self, text_splitter=None):
    # Generate a unique filename
    temp_filename = str(uuid.uuid4()) + '.doc'
    
    # Create a temporary directory
    temp_dir = os.path.join(os.getcwd(), 'temp')
    os.makedirs(temp_dir, exist_ok=True)
    
    # Full path to the temporary file
    temp_file_path = os.path.join(temp_dir, temp_filename)
    
    # Write the content of the stream into the temporary file
    with open(temp_file_path, 'wb') as f:
        f.write(self.stream.read())
    
    # Use textract to extract the text from the file
    text = textract.process(temp_file_path).decode('utf-8')
    
    if text_splitter is not None:
        split_text = text_splitter.create_documents([text])
    else:
        split_text = [{'text': text, 'metadata': {'source': self.filename}}]

    for doc in split_text:
        if isinstance(doc, dict):
            doc['metadata'] = {**doc.get('metadata', {}), 'source': self.filename}
        else:
            doc.metadata = {**doc.metadata, 'source': self.filename}

    # Remove the temporary file
    os.remove(temp_file_path)

    return split_text

I hope this can help someone!