Search code examples
pythonregexqregularexpression

regex match whole word and punctuation with it using re.search()


New to regex. Aim- To match a whole word which might have either '.' or '-' with it at the end. I want to keep it for the .start() and .end() position calculation.

txt = "The indian in. Spain."
pattern = "in."

x = re.search(r"\b" + pattern + r"\b" , txt)

print(x.start(), x.end())

I want the position for 'in.' word, as highlighted "The indian in. Spain.". The expression I have used gives error for a Nonetype object. What would be the expression to match the '.' in the above code? Same if '-' is present instead of '.'


Solution

  • There are two issues here.

    1. In regex . is special. It means "match one of any character". However, you are trying to use it to match a regular period. (It will indeed match that, but it will also match everything else.) Instead, to match a period, you need to use the pattern \.. And to change that to match either a period or a hyphen, you can use a class, like [-.].
    2. You are using \b at the end of your pattern to match the word boundary, but \b is defined as being the boundary between a word character and a non-word character, and periods and spaces are both non-word characters. This means that Python won't find a match. Instead, you could use a lookahead assertion, which will match whatever character you want, but won't consume the string.

    Now, to match a whole word - any word - you can do something like \w+, which matches one or more word characters.

    Also, it is quite possible that there won't be a match anyway, so you should check whether a match occurred using an if statement or a try statement. Putting it all together:

    txt = "The indian in. Spain."
    pattern = r"\w+[-.]"
    x = re.search(r"\b" + pattern + r"(?=\W)", txt)
    if x:
        print(x.start(), x.end())
    

    Edit

    There is one problem with the lookahead assertion above - it won't match the end of the string. This means that if your text is The rain in Spain. then it won't match Spain., as there is no non-word character following the final period.

    To fix this, you can use a negative lookahead assertion, which matches when the following text does not include the pattern, and also does not consume the string.

    x = re.search(r"\b" + pattern + r"(?!\w)", txt)
    

    This will match when the character after the word is anything other than a word character, including the end of the string.