Search code examples
python-re

Python Regex: Match a sentence starting with title and contains "ask'


I just want to extract all instances of a sentence

  1. starts with a title (ie. Mr, Miss, Ms or Dr)
  2. contains the word "asked"
  3. end with .

I tried the below regex but got back an empty list. Thank you

import re

text_list="26 Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation. We agree with the Panel and will instead strengthen regulations to safeguard the safety of path users. With regard to Ms Rahayu Mahzam's suggestion of tapping on the Small Claims Tribunal for personal injury claims up to $20,000, we understand that the Tribunal does not hear personal injury claims.  Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs. Mr Melvin Yong asked about the qualifications and training of OEOs."

asked_regex=re.compile(r'^(Mr|Miss|Ms|Dr)(.|\n){1,}(asked)(.|\n){1,}\.$')
asked=re.findall(asked_regex, text_list)

Desired Output:
["Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation. ",
"Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs.",
"Mr Melvin Yong asked about the qualifications and training of OEOs."]


Solution

  • try this regex pattern:

    import re
    
    text_list="26 Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation. We agree with the Panel and will instead strengthen regulations to safeguard the safety of path users. With regard to Ms Rahayu Mahzam's suggestion of tapping on the Small Claims Tribunal for personal injury claims up to $20,000, we understand that the Tribunal does not hear personal injury claims.  Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs. Mr Melvin Yong asked about the qualifications and training of OEOs."
    
    asked_regex=re.compile(r'(Mr|Miss|Ms|Dr)[^\.]*asked[^\.]*\.')
    asked=re.findall(asked_regex, text_list)
    

    (Mr|Miss|Ms|Dr)

    this will search for all sentences that start with Mr,Miss,Ms,Dr (your pattern would only look for those that were at start of the string.)

    [^\.]*asked[^\.]*

    this part accepts any string that has word asked in it and before and after of asked is not a full stop or ..

    \.

    checks that sentence ends with full stop or .


    I'm sure regex is right but I don't know why it doesn't work with findall. here is the code that regex101.com generated based on the pattern and it works.

    # coding=utf8
    # the above tag defines encoding for this document and is for Python 2.x compatibility
    
    import re
    
    regex = r"(Mr|Miss|Ms|Dr)[^\.]*asked[^\.]*\."
    
    test_str = "26 Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation. We agree with the Panel and will instead strengthen regulations to safeguard the safety of path users. With regard to Ms Rahayu Mahzam's suggestion of tapping on the Small Claims Tribunal for personal injury claims up to $20,000, we understand that the Tribunal does not hear personal injury claims.  Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs. Mr Melvin Yong asked about the qualifications and training of OEOs."
    
    matches = re.finditer(regex, test_str, re.MULTILINE)
    
    for matchNum, match in enumerate(matches, start=1):
        
        print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
        
        for groupNum in range(0, len(match.groups())):
            groupNum = groupNum + 1
            
            print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
    
    # Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.```