I am trying to extract the part of a text file after the second occurrence of a specific word till to end of again second occurrence of another specific word. The reason is that these two words are stated at first in the table of contents. Therefore when I run the code I got 0 output of the first occurences.
Sample text:
Table of contents
Item 1a.Risk Factors
Item 1b
End of table of contents
Main content
Item 1a. Risk Factors
Item 1b
I need the extract the text between the second occurrence of Item 1a. Risk Factors and the second occurrence of Item 1b.
My code below:
for file in tqdm(files):
with open(file, encoding='ISO-8859-1') as f:
for line in f:
if line.strip() == 'Item 1A.Risk Factors':
break
for line in f:
if line.strip() == 'Item 1B':
break
f=open(os.path.join('QTR4_Risk_Factors',
os.path.basename(file)) , 'w')
f.write(line)
f.close()
There are few problems with the code you wrote, one of each is that you do not save the part of text you need while scanning the document looking for the "end text". Also it is best practice to store as little of the text in memory, if possible, because we don't know how big the document you are trying to analyze is. To do that we can write to the new file while we are reading the original.
Ronie's answer is going in the right direction but it doesn't address the fact that you want to start saving the text only after the second occurrence of your "start hint". Unfortunately I am not yet able to comment to suggest the edit, so I am adding it as a new answer. Try this:
for file in tqdm(files):
with open(file, encoding='ISO-8859-1') as f, open(os.path.join('QTR4_Risk_Factors', os.path.basename(file)) , 'w') as w:
start_hint_counter = 0
write = False
for line in f:
if write is False and line.strip() == 'Item 1A.Risk Factors':
start_hint_counter += 1
if start_hint_counter == 2:
write = True
if write is True:
if line.strip() == 'Item 1B':
break
else:
w.write(line)