Search code examples
regexword-boundary

Why doesn't Regex \b metacharacter match Turkish characters at the end of the word?


I am trying to parse words only using Regex in a string. This string contains Turkish characters which are çğıİöşü.

I tried \b[\wçğıİöşü]+\b regex pattern but it doesn't work totally well.

enter image description here

In the above picture I was expecting the pattern to be matched Behiç and Güneş completely. But it only matches Behi and Güne as you can see. What is the correct pattern to match Behiç and Güneş?


Solution

  • The result you are getting is because the default regex mode in Regex101 is PCRE (PHP) with support for unicode characters turned off. If you change the flavor to Python (q.v. the demo below), you will see the behavior you expect.

    Just turn on support for unicode or UTF-8 and your problem should be solved.

    Demo