Search code examples
pythonregexpattern-matchingmarkupproximity

Python regex to print all sentences that contain two identified classes of markup


I wish to read in an XML file, find all sentences that contain both the markup <emotion> and the markup <LOCATION>, then print those entire sentences to a unique line. Here is a sample of the code:

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer <pronoun> I </pronoun> have ever heard." 

out = open('out.txt', 'w')

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bwonderful(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)
    out.write(line + '\n')

out.close()

The regex here grabs all sentences with "wonderful" and "omaha" in them, and returns:

Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>.

Which is perfect, but I really want to print all sentences that contain both <emotion> and <LOCATION>. For some reason, though, when I replace "wonderful" in the regex above with "emotion," the regex fails to return any output. So the following code yields no result:

import re

text = "Cello is a <emotion> wonderful </emotion> parakeet who lives in <LOCATION> Omaha </LOCATION>. He is the <emotion> best </emotion> singer I have ever heard." 

out = open('out.txt', 'w')

for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
    line = ''.join(str(x) for x in match)
    out.write(line + '\n')

out.close()

My question is: How can I modify my regular expression in order to grab only those sentences that contain both <emotion> and <LOCATION>? I would be most grateful for any help others can offer on this question.

(For what it's worth, I'm working on parsing my text in BeautifulSoup as well, but wanted to give regular expressions one last shot before throwing in the towel.)


Solution

  • Your problem appears to be that your regex is expecting a space (\s) to follow the matching word, as seen with:

    emotion(?=\s|\.|$)
    

    Since when it's part of a tag, it's followed by a >, rather than a space, no match is found since that lookahead fails. To fix it, you can just add the > after emotion, like:

    for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bomaha(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
        line = ''.join(str(x) for x in match)
    

    Upon testing, this seems to solve your problem. Make sure and treat "LOCATION" similarly:

    for match in re.findall(r'(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\bemotion>(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\bLOCATION>(?=\s|\.|$)).*?\.(?=\s|$))', text, flags=re.I):
        line = ''.join(str(x) for x in match)