I'm trying to analyze an article to determine if a specific substring appears.
If "Bill"
appears, then I want to delete the substring's parent sentence from the article, as well as every sentence following the first deleted sentence.
If "Bill"
does not appear, no alteration are made to the article.
Sample Text:
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, Star Fox in the way you can rotate your craft to fit through narrow gaps.
This is Bill, signing off. Thank you for reading. And see you tomorrow!"""
Desired Result When Targeted Substring is "Bill":
stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, but does that hindsight extend to this thoroughly literally-named racing tie-in? Star Fox in the way you can rotate your craft to fit through narrow gaps.
"""
This is the code so far:
if "Bill" not in stringy[-200:]:
print(stringy)
text = stringy.rsplit("Bill")[0]
text = text.split('.')[:-1]
text = '.'.join(text) + '.'
It currently doesn't work when "Bill"
appears outside of the last 200 characters, cutting off the text at the very first instance of "Bill"
(the opening sentence, "This is Bill Everest here"
). How can this code be altered to only select for "Bill"
s in the last 200 characters?
Here's another approach that loops through each sentence using a regex. We keep a line count and once we're in the last 200 characters we check for 'Bill' in the line. If found, we exclude from this line onward.
Hope the code is readable enough.
import re
def remove_bill(stringy):
sentences = re.findall(r'([A-Z][^\.!?]*[\.!?]\s*\n*)', stringy)
total = len(stringy)
count = 0
for index, line in enumerate(sentences):
#Check each index of 'Bill' in line
for pos in (m.start() for m in re.finditer('Bill', line)):
if count + pos >= total - 200:
stringy = ''.join(sentences[:index])
return stringy
count += len(line)
return stringy
stringy = remove_bill(stringy)