Search code examples
pythontika-python

Find multiple text in pdfs


I'm currently trying to pull pdf's with the following list of text. I was able to pull pdf's but with only one word. should i change my condition below? thanks in advance. newbie here.

from tika import parser
import glob

path = glob.glob(r"C:\Users\kxdane\Desktop\TEST\OKED\*.pdf")

for path in path:

pdf_files = glob.glob(path)

text = (['Disclosure','M.D.'])
for file in pdf_files:
    raw = parser.from_file(file)
    if text in raw['content']:
        print(file)`

Solution

  • In python, substring search works only with single argument. So you need to search for all substrings in a loop and combine the results using logical AND, for example like this:

    ...
    words = ['Disclosure','M.D.']
    for file in pdf_files:
        raw = parser.from_file(file)
        found = True
        for word in words:
          if word not in raw['content']:
            found = False
        if found:
          print(file)
    

    Note: if words is empty list, this will match all pdf_files.