javascript regex special-characters highlight

javascript regex treats swedish characters as special charachters and matches incorrectly

I am currently working on a JavaScript feature that involves highlighting search results. Specifically, I want to implement a functionality where searching for a word, such as 'sea', within a sentence such as 'the sea causes me nausea in this season' will result in the word 'sea' and any instances where it acts as a prefix like the word 'season' to be higlighted. However, I do not want to highlight occurrences of 'sea' when it appears as a postfix like in the word 'nausea' nor when it is in the middle of a word like 'disease'.

To achieve this, I am using the regular expression /\bsea/gmi, which works perfectly with English characters. However, it fails to produce the desired results when applied to Swedish characters, like 'ä', 'å', and 'ö'. For example, if the search word is 'gen', the postfix 'gen' in the word 'vägen' is incorrectly highlighted. It seems that the regular expression treats these characters as special characters or something similar. I even tried adding unicode modifier u but that didt't help either.

Since my expertise lies mainly in C#, I'm not familiar with how JavaScript behaves in this context. I would greatly appreciate any insights or guidance on how JavaScript handles these situations or how to work around this problem.

Solution

Javascript's regex engine doesn't change behavior of \b depending on presence of u flag. But luckily you can imitate it using Unicode property classes.

In this exact case your regex would look like this: /(?<![\p{L}\p{N}_])gen/gmiu.

Here we check (using negative lookbehind) that gen is not immediately preceded by any of:

\p{L}: letter (in any language),
\p{N}: digit (in any language)
_.

Basically [\p{L}\p{N}_] is alternative to \w with considering of u flag. Please notice that this is default behavior in some other regex engines, for example PCRE.

Demo here.

And in general case \b can be replaced with /(?<![\p{L}\p{N}_])(?=[\p{L}\p{N}_])|(?<=[\p{L}\p{N}_])(?![\p{L}\p{N}_])/gmu.

Demo here.