Search code examples
pythonregextextextractanalysis

How to extract the part of a text file after the second occurrence of a specific word using Python


I am trying to extract the part of a text file after the second occurrence of a specific word till to end of again second occurrence of another specific word. The reason is that these two words are stated at first in the table of contents. Therefore when I run the code I got 0 output of the first occurences.

Sample text:

Table of contents

Item 1a.Risk Factors

  • not any text (unwanted portion)

Item 1b

End of table of contents

Main content
Item 1a. Risk Factors

  • text (wanted portion)
  • text (wanted portion)
  • text (wanted portion)

Item 1b

I need the extract the text between the second occurrence of Item 1a. Risk Factors and the second occurrence of Item 1b.

My code below:

for file in tqdm(files):
with open(file, encoding='ISO-8859-1') as f:
    for line in f:
        if line.strip() == 'Item 1A.Risk Factors': 
            break
    for line in f: 
        if line.strip() == 'Item 1B':
            break
f=open(os.path.join('QTR4_Risk_Factors',
os.path.basename(file)) , 'w')
f.write(line)
f.close()

Solution

  • There are few problems with the code you wrote, one of each is that you do not save the part of text you need while scanning the document looking for the "end text". Also it is best practice to store as little of the text in memory, if possible, because we don't know how big the document you are trying to analyze is. To do that we can write to the new file while we are reading the original.

    Ronie's answer is going in the right direction but it doesn't address the fact that you want to start saving the text only after the second occurrence of your "start hint". Unfortunately I am not yet able to comment to suggest the edit, so I am adding it as a new answer. Try this:

    for file in tqdm(files):
        with open(file, encoding='ISO-8859-1') as f, open(os.path.join('QTR4_Risk_Factors', os.path.basename(file)) , 'w') as w:
            start_hint_counter = 0
            write = False
            for line in f:
                if write is False and line.strip() == 'Item 1A.Risk Factors': 
                    start_hint_counter += 1
                    if start_hint_counter == 2:
                        write = True
                if write is True:
                    if line.strip() == 'Item 1B':
                        break
                    else:
                        w.write(line)