Search code examples
pythonregexlistpython-re

Python Regex : I would like to find all overlapping and non overlapping pattern matches


I would like to find all overlapping and non overlapping pattern matches

Here is the code:

import re

words = [r"\bhello\b",r"\bworld\b",r"\bhello world\b"] 
sentence = "Hola hello world and hello"

for word in words:
    for match in re.finditer(word,sentence):
        print(match.span(),match.group())

Gives me the following result (I'm happy about this, but need an efficient way)

(5, 10) hello
(21, 26) hello
(11, 16) world
(5, 16) hello world 

I know this is not very efficient. Example : Assume I have 20k words and 10k sentences, it would be 200M x 2 calls to re.match which takes a lot of time.

Could you please suggest me an efficient way to solve the problem?


Solution

  • Different order, but same result and significantly faster likely:

    import re
    
    substrings=['hello','world','hello world']
    joined='|'.join(substrings)
    reg=re.compile(rf"\b(?={joined}\b)")
    
    for m in reg.finditer(sentence):
        for e in substrings:
            offset=m.span()[0]
            if sentence[offset:offset+len(e)]==e:
                print((offset,offset+len(e)), e)
    

    If you want to make sure that you do not match hello worldLY (ie, just the prefix of the substring) you can do:

    substrings=['hello','world','hello world']
    joined='|'.join(substrings)
    reg=re.compile(rf"\b(?={joined}\b)")
    indvidual=list(map(re.compile, [rf'\b({e})\b' for e in substrings]))
    for m in reg.finditer(sentence):
        for i,e in enumerate(indvidual):
            offset=m.span()[0]
            if m2:=e.match(sentence[offset:]):
                print((offset,offset+m2.span()[1]), substrings[i])
    

    Either prints:

    (5, 10) hello
    (5, 16) hello world
    (11, 16) world
    (21, 26) hello