Search code examples
pythonpython-3.xregexpython-docx

Python3 Docx get text between 2 paragraphs


i have .docx files in a directory and i want to get all text between two paragraphs.

Example:

Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :

I want to get :

The foo is not easy, but we have to do it.
We are looking for new things in our ad libitum way of life. 

I wrote this code :

import docx
import pathlib
import glob
import re

def rf(f1):
    reader = docx.Document(f1)
    alltext = []
    for p in reader.paragraphs:
        alltext.append(p.text)
    return '\n'.join(alltext)


for f in docxfiles:
    try:
        fulltext = rf(f)
        testf = re.findall(r'Foo\s*:(.*)\s*Bar', fulltext, re.DOTALL)
        
        print(testf)
    except IOError:
        print('Error opening',f)

it returns None

What am I doing wrong ?


Solution

  • If you loop over all paragraphs and print paragraphs texts you get the document text as is - but the single p.text of your loop does not contain the full documents text.

    It works with a string:

    t = """Foo :
    
    The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.
    
    Bar :"""
          
    import re
          
    fread = re.search(r'Foo\s*:(.*)\s*Bar', t)
          
    print(fread)  # None  - because dots do not match \n
         
    fread = re.search(r'Foo\s*:(.*)\s*Bar', t, re.DOTALL)
          
    print(fread)
    print(fread[1])
    

    Output:

    <_sre.SRE_Match object; span=(0, 115), match='Foo :\n\nThe foo is not easy, but we have to do i>
    
    
    The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.
    

    If you use

    for p in reader.paragraphs:
        print("********")
        print(p.text)
        print("********")
    

    you see why your regex wont match. Your regex would work on the whole documents text.

    See How to extract text from an existing docx file using python-docx how to get the whole docs text.

    You could as well look for a paragraph that matches r'Foo\s*:' - then put all following paragraph.text's into a list until you hit a paragraph that matches r'\s*Bar'.