Search code examples
pythonpython-3.xregexstringregex-group

How to generalize this regex so that it starts capturing substrings at the beginning of a string or if it is followed by some other word?


import re

name = "John"

#In these examples it works fine
input_sense_aux = "These sound system are too many, I think John can help us, otherwise it will be waiting for a while longer"
#input_sense_aux = "These sound system are too many but I know that John can help us, otherwise it will be waiting for a while longer"
#input_sense_aux = "These sound system are too many but I know that John can help us. otherwise it will be waiting for a while longer"
#input_sense_aux = "Do you know if John with the others could come this afternoon?"

#In these examples it does not work well
#input_sense_aux = "John can help us, otherwise it will be waiting for a while longer"
#input_sense_aux = "Can you help us, otherwise it will be waiting for a while longer for John"
#input_sense_aux = "sorry! can you help us? otherwise it will be waiting for a while longer for John"



regex_patron_m1 = r"\s*((?:\w\s*)+)\s*?" + name + r"\s*((?:\w\s*)+)\s*\??"
m1 = re.search(regex_patron_m1, input_sense_aux, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code
if m1:
    something_1, something_2 = m1.groups()

    something_1 = something_1.strip()
    something_2 = something_2.strip()

    print(repr(something_1))
    print(repr(something_2))

I need the regex to grab the content before "John" like this:

(start of sentence|¿|¡|,|;|:|(|[|.) \s* "content for something_1" \s* John

And then:

John \s* "content for something_2" \s* (end of sentence|?|!|,|;|:|)|]|.)

In the fists examples, the regex works fine:

'these teams are too many but I know that'
'can help us'
'Do you know if'
'with the others could come this afternoon'

But with the cases of the last 3 examples the regex does not return anything

And I need help to be able to generalize my regex to all these cases and at the same time respect the conditions in which it must extract the content of something_1 and something_2

For the 3 last examples, the expected results are:

''
' can help us'
' otherwise it will be waiting for a while longer for '
''
' otherwise it will be waiting for a while longer for '
''

Solution

  • You can use

    import re
    
    name = "John"
    
    input_sense_auxs = [
        "These sound system are too many, I think John can help us, otherwise it will be waiting for a while longer",
        "These sound system are too many but I know that John can help us, otherwise it will be waiting for a while longer",
        "These sound system are too many but I know that John can help us. otherwise it will be waiting for a while longer",
        "Do you know if John with the others could come this afternoon?",
    
        "John can help us, otherwise it will be waiting for a while longer",
        "Can you help us, otherwise it will be waiting for a while longer for John",
        "sorry! can you help us? otherwise it will be waiting for a while longer for John"]
    
    regex_patron_m1 = fr'(?:^|[?!¿¡,;:([.])\s*(?:(\w+(?:\s+\w+)*)\s*)?{name}(?:\s*(\w+(?:\s+\w+)*))?\s*(?:$|[]?!,;:).])'
    # r"\s*((?:\w\s*)+)\s*?" + name + r"\s*((?:\w\s*)+)\s*\??"
    for input_sense_aux in input_sense_auxs:
        print(f'--- {input_sense_aux} ---')
        m1 = re.search(regex_patron_m1, input_sense_aux, re.IGNORECASE) #Con esto valido la regex haber si entra o no en el bloque de code
        if m1:
            something_1, something_2 = m1.groups()
    
            something_1 = something_1.strip() if something_1 else ""
            something_2 = something_2.strip() if something_2 else ""
    
            print(repr(something_1))
            print(repr(something_2))
    

    Output:

    --- These sound system are too many, I think John can help us, otherwise it will be waiting for a while longer ---
    'I think'
    'can help us'
    --- These sound system are too many but I know that John can help us, otherwise it will be waiting for a while longer ---
    'These sound system are too many but I know that'
    'can help us'
    --- These sound system are too many but I know that John can help us. otherwise it will be waiting for a while longer ---
    'These sound system are too many but I know that'
    'can help us'
    --- Do you know if John with the others could come this afternoon? ---
    'Do you know if'
    'with the others could come this afternoon'
    --- John can help us, otherwise it will be waiting for a while longer ---
    ''
    'can help us'
    --- Can you help us, otherwise it will be waiting for a while longer for John ---
    'otherwise it will be waiting for a while longer for'
    ''
    --- sorry! can you help us? otherwise it will be waiting for a while longer for John ---
    'otherwise it will be waiting for a while longer for'
    ''
    

    See the Python demo.

    Details:

    • (?:^|[?!¿¡,;:([.])\s*(?:(\w+(?:\s+\w+)*)\s*)? - the prefix, the left-hand side part, that matches
      • (?:^|[?!¿¡,;:([.]) - either start of string or a char from the ?!¿¡,;:([. set
      • \s* - zero or more whitespaces
      • (?:(\w+(?:\s+\w+)*)\s*)? - an optional occurrence of
        • (\w+(?:\s+\w+)*) - Group 1: one or more word chars and then zero or more sequences of one or more whitespaces and one or more word chars
        • \s* - zero or more whitespaces
    • John - the name
    • (?:\s*(\w+(?:\s+\w+)*))?\s*(?:$|[]?!,;:).]) - the right-hand part:
      • \s* - zero or more whitespaces
      • (\w+(?:\s+\w+)*))? - Group 2: an optional sequence of one or more word chars and then zero or more occurrences of one or more whitespaces followed with one or more word chars
      • \s* - zero or more whitespaces
      • (?:$|[]?!,;:).]) - end of string or a char from the ]?!,;:). charset.

    See the regex demo.