Search code examples
pythonregexregex-lookarounds

Regex match two strings with given number of words in between strings


I want to match characters including and between two strings. (for a given number of words acceptable between them)

For example:

text = 'I want apples and oranges'
Parameters are 'apples', 'oranges' and k=2, which is the max allowable words in between these strings words. I am expecting the output to be 'apples and oranges' because there is only one word between the two given strings

This is very similar to the (?<=...) pattern in regex but I am unable to define the number of acceptable words in between and I want the relevant text extracted instead of just what's in between

What I have now:

import re
text = 'I want apples and oranges'
pattern = "(?<=apples)(.*)(?=oranges)"
m = re.search(pattern, text)
print(m)

<re.Match object; span=(13, 18), match=' and '>

This outputs ' and '. But I want to get the output of apples and oranges, instead of only whats in between. And I want to be able to limit the number of acceptable words between apples and oranges. For example, if I define k = 2 and if the sentence is "I want apples along with some oranges" this should not match because there are 3 words between apples and oranges.

Does anyone know if I can do this with regex as well?


Solution

  • You can use something like

    import re
    text = 'I want apples and oranges'
    k = 2
    pattern = f"apples(?:\s+\w+){{0,{k}}}\s+oranges"
    m = re.search(pattern, text)
    if m:
        print(m.group())
    
    # => apples and oranges
    

    Here, I used \w+ to match a word. If the word is a non-whitespace chunk, you need to use

    pattern = f"apples(?:\s+\S+){{0,{k}}}\s+oranges"
    

    See this Python demo.

    If you need to add word boundaries, you need to study the Word boundary with words starting or ending with special characters gives unexpected results and Match a whole word in a string using dynamic regex posts. For the current example, fr"\bapples(?:\s+\w+){{0,{k}}}\s+oranges\b" will work.

    The pattern will look like apples(?:\s+\w+){0,k}\s+oranges and match

    • apples - an apples string
    • (?:\s+\w+){0,k} - zero to k repetitions of one or more whitespaces and one or more word chars
    • \s+ - one or more whitespaces
    • oranges an oranges string.