I'm currently trying to pull pdf's with the following list of text. I was able to pull pdf's but with only one word. should i change my condition below? thanks in advance. newbie here.
from tika import parser
import glob
path = glob.glob(r"C:\Users\kxdane\Desktop\TEST\OKED\*.pdf")
for path in path:
pdf_files = glob.glob(path)
text = (['Disclosure','M.D.'])
for file in pdf_files:
raw = parser.from_file(file)
if text in raw['content']:
print(file)`
In python, substring search works only with single argument. So you need to search for all substrings in a loop and combine the results using logical AND, for example like this:
...
words = ['Disclosure','M.D.']
for file in pdf_files:
raw = parser.from_file(file)
found = True
for word in words:
if word not in raw['content']:
found = False
if found:
print(file)
Note: if words
is empty list, this will match all pdf_files.