Search code examples
pythonregexstringposition

Positions of substrings in string


I need to know all the positions of a word in a text - substring in string. The solution so far is to use a regex, but I am not sure if there not better, may builtin standard library strategies. Any ideas?

import re

text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
links = {'fox': [], 'dog': []}
re_capture = u"(^|[^\w\-/])(%s)([^\w\-/]|$)" % "|".join(links.keys())

iterator = re.finditer(re_capture, text)

if iterator:
    for match in iterator:

        # fix position by context 
        # (' ', 'fox', ' ')
        m_groups = match.groups()
        start, end = match.span()
        start = start + len(m_groups[0])
        end = end - len(m_groups[2])

        key = m_groups[1]
        links[key].append((start, end))

print links

{'fox': [(16, 19), (45, 48)], 'dog': [(40, 43)]}

Edit: Partial words are not allowed to match - see fox of Redfox is not in links.

Thanks.


Solution

  • If you want to match actual words and your strings contain ascii:

    text = "fox The quick brown fox jumps over the fox! lazy dog. fox!."
    links = {'fox': [], 'dog': []}
    
    from string import punctuation
    def yield_words(s,d):
        i = 0
        for ele in s.split(" "):
            tot = len(ele) + 1
            ele = ele.rstrip(punctuation)
            ln = len(ele)
            if ele in d:
                d[ele].append((i, ln + i))
            i += tot
        return d
    

    This unlike the find solution won't match partial words and does it in O(n) time:

    In [2]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
    
    In [3]: links = {'fox': [], 'dog': []}
    
    In [4]: yield_words(text,links)
    Out[4]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}
    

    This is probably one case where a regex is a good approach, it can just be much simpler:

    def reg_iter(s,d):
        r = re.compile("|".join([r"\b{}\b".format(w) for w in d]))
        for match in r.finditer(s):
            links[match.group()].append((match.start(),match.end()))
        return d
    

    Output:

    In [6]: links = {'fox': [], 'dog': []}
    
    In [7]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox."
    
    In [8]: reg_iter(text, links)
    Out[8]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}