I have sentences that quote text inside them, like:
Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread this: "If anybody had asked trial of answered at once, 'My nose.'" What is the correct definition of the word "trial" as it is used here?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?
I am trying to mask the quoted parts with REGEX but it's not accurate. For instance, for the last sentence:
txt = 'Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?'
print(re.sub(r"(?<=\").{20,}(?=\")", "<quote>", txt))
The output is:
Reread these sentences: "<quote>" mean?
Instead, it should be:
Reread these sentences: "<quote>" What does the word "courtship" mean?
Since I have > 10k instances, it's really hard to find a common REGEX pattern that works with all the cases.
My question is, is there any library (maybe implemented based on a neural network?) or approach to solve this problem?
For these examples use
import re
txt = """Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say “I would have tea instead.”
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were "the lights of his town growing smaller below them"?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: "This was his courtship, and it lasted all through the summer." What does the word "courtship" mean?"""
txt = re.sub(r'''"([^"]*)"''', lambda m: '<quote>' if len(m.group(1))>19 else m.group(), txt)
txt = re.sub(r'“[^“”]{20,}”', '<quote>', txt)
print(txt)
See Python proof. For various types of quotes, use separate commands, this makes it easier to control.
Results:
Why did the author use three sentences in a row that start with the words, "it spun"?
Why did the queen most likely say <quote>
Why did the fdsfdsf repeat the phrase "he waited" so many times?
Why were <quote>?
What is a fdsfdsf for the word "adjust"?
Reread these sentences: <quote> What does the word "courtship" mean?