Search code examples
pythonapache-tika

Cutting a string based on the start keyword and end key word of the string python


I have a pdf which I have read via Tika package in python. It seems tika can only read a whole pdf and i need to read only the first page.

My code looks like:

from tika import parser
raw = parser.from_file(pdfname)
rawtext = raw['content']

I would like to split the rawtext by start keyword and end keyword. How do I do that?


Solution

  • You can use a regex to select the text that you are interested, for example:

    import re
    
    
    raw_text = 'this is a sample of text'
    start = 'is'
    end = 'of'
    
    start_index = re.search(r'\b' + start + r'\b', raw_text).start()
    end_index = re.search(r'\b' + end + r'\b', raw_text).end()
    section_of_text = raw_text[start_index:end_index]
    print(section_of_text)
    
    >>> "is a sample of"