i have .docx files in a directory and i want to get all text between two paragraphs.
Example:
Foo :
The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.
Bar :
I want to get :
The foo is not easy, but we have to do it.
We are looking for new things in our ad libitum way of life.
I wrote this code :
import docx
import pathlib
import glob
import re
def rf(f1):
reader = docx.Document(f1)
alltext = []
for p in reader.paragraphs:
alltext.append(p.text)
return '\n'.join(alltext)
for f in docxfiles:
try:
fulltext = rf(f)
testf = re.findall(r'Foo\s*:(.*)\s*Bar', fulltext, re.DOTALL)
print(testf)
except IOError:
print('Error opening',f)
it returns None
What am I doing wrong ?
If you loop over all paragraphs and print paragraphs texts you get the document text as is - but the single p.text
of your loop does not contain the full documents text.
It works with a string:
t = """Foo :
The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.
Bar :"""
import re
fread = re.search(r'Foo\s*:(.*)\s*Bar', t)
print(fread) # None - because dots do not match \n
fread = re.search(r'Foo\s*:(.*)\s*Bar', t, re.DOTALL)
print(fread)
print(fread[1])
Output:
<_sre.SRE_Match object; span=(0, 115), match='Foo :\n\nThe foo is not easy, but we have to do i>
The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.
If you use
for p in reader.paragraphs:
print("********")
print(p.text)
print("********")
you see why your regex wont match. Your regex would work on the whole documents text.
See How to extract text from an existing docx file using python-docx how to get the whole docs text.
You could as well look for a paragraph that matches r'Foo\s*:'
- then put all following paragraph.text's into a list until you hit a paragraph that matches r'\s*Bar'
.