I have a pdf which I have read via Tika package in python. It seems tika can only read a whole pdf and i need to read only the first page.
My code looks like:
from tika import parser
raw = parser.from_file(pdfname)
rawtext = raw['content']
I would like to split the rawtext by start keyword and end keyword. How do I do that?
You can use a regex
to select the text that you are interested, for example:
import re
raw_text = 'this is a sample of text'
start = 'is'
end = 'of'
start_index = re.search(r'\b' + start + r'\b', raw_text).start()
end_index = re.search(r'\b' + end + r'\b', raw_text).end()
section_of_text = raw_text[start_index:end_index]
print(section_of_text)
>>> "is a sample of"