Search code examples
pythonregexstringsubstringfindall

How to return full substring from partial substring match in python as a list?


I have different length strings which have to be checked for substrings which match patterns of "tion", "ex", "ph", "ost", "ast", "ist" ignoring the case and the position i.e. prefix/suffix/middle of word. The matching words have to be returned in a new list rather than the matching substring element alone. With the below code I can return a new list of matching substring element without the full matching word.

def latin_ish_words(text):
    import re
    pattern=re.compile(r"tion|ex|ph|ost|ast|ist")
    matches=pattern.findall(text)
    return matches
latin_ish_words("This functions as expected")

With the results as follows:['tion', 'ex']

I was wondering how I could return the whole word rather than the matching substring element into a newlist?


Solution

  • You can use

    pattern=re.compile(r"\w*?(?:tion|ex|ph|ost|ast|ist)\w*")
    pattern=re.compile(r"[a-zA-Z]*?(?:tion|ex|ph|ost|ast|ist)[a-zA-Z]*")
    pattern=re.compile(r"[^\W\d_]*?(?:tion|ex|ph|ost|ast|ist)[^\W\d_]*")
    

    The regex (see the regex demo) matches

    • \w*? - zero or more but as few as possible word chars
    • (?:tion|ex|ph|ost|ast|ist) - one of the strings
    • \w* - zero or more but as many as possible word chars

    The [a-zA-Z] part will match only ASCII letters, and [^\W\d_] will match any Unicode letters.

    Mind the use of the non-capturing group with re.findall, as otherwise, the captured substrings will also get their way into the output list.

    If you need to only match letter words, and you need to match them as whole words, add word boundaries, r"\b[a-zA-Z]*?(?:tion|ex|ph|ost|ast|ist)[a-zA-Z]*\b".

    See the Python demo:

    import re
    def latin_ish_words(text):
        import re
        pattern=re.compile(r"\w*?(?:tion|ex|ph|ost|ast|ist)\w*")
        return pattern.findall(text)
     
    print(latin_ish_words("This functions as expected"))
    # => ['functions', 'expected']