Search code examples
pythonubuntudocxpython-docx

Issue reading text with python-docx when document contains Images


I am having issues parsing text from a document that contains images.

I am using version 0.7.0 of Python docx on a Ubuntu Linux machine running Ubuntu 12.04.4 LTS (GNU/Linux 3.2.0-60-generic x86_64)

I am using this logic:

```

        document = Document(path)
        # Get all paragraphs
        paras = document.paragraphs

        text = ""

        # Push the text from the paragraph on a single string
        for para in paras:
            # Don't forget the line break
            text += "\n" + para.text

        return text.strip()

```

When there is an image this process fails.

Is there something I am doing wrong?


Solution

  • python-docx should support what you're trying to do here. If you'll provide the stack trace you get when the error is raised I'll take a look.

    Btw, you can code this a little more elegantly as:

    document = Document(path)
    text = '\n'.join([para.text for para in document.paragraphs])