I have a long list of docx files stored in several directories and I need to search for documents containing specific strings in those files.
I have all files in a list like this one (coming from a pandas df column):
files=['C:\AAA\BBB\file1.docx','G:\CCC\DDD\file2.docx'...]
I also have a list of strings like this one: strings=['hksdhus','jshaòohse','iueoiwu']
Here is my code:
for string in strings:
for file in files:
doc=docx.Document(file)
for para in doc.paragraphs:
if string in para.text:
print(string+' is in '+file)
break
else:
print(string+' not found')
break
It gives me always "string not found" because, I think, it's not reading the file at all; if I try print(para.text) it turns me blank.
Can anyone help me with this?
Thanks in advance for any suggestion
Your code is only checking the first paragraph because you have a break
after the first paragraph is checked no matter whether the string is found or not.
Remove the 3-line else block and try again. You could just remove the break
on the last line, but that's going to print a string not found
message for every paragraph in the document if you leave it in there.