Search code examples
javascriptregexsymbols

Regex not finding two letter words that include Swedish letters


So I am very new with Regex and I have managed to create a way to check if a specific word exists inside of a string without just being part of another word.

Example: I am looking for the word "banana". banana == true, bananarama == false

This is all fine, however a problem occurs when I am looking for words containing Swedish letters (Å,Ä,Ö) with words containing only two letters.

Example: I am looking for the word "på" in a string looking like this: "på påsk" and it comes back as negative. However if I look for the word "påsk" then it comes back positive. This is the regex I am using:

const doesWordExist = (s, word) => new RegExp('\\b' + word + '\\b', 'i').test(s);
stringOfWords = "Färg på plagg";
console.log(doesWordExist(stringOfWords, "på"))
//Expected result: true
//Actual result: false

However if I were to change the word "på" to a three letter word then it comes back true:

const doesWordExist = (s, word) => new RegExp('\\b' + word + '\\b', 'i').test(s);
stringOfWords = "Färg pås plagg";
console.log(doesWordExist(stringOfWords, "pås"))
//Expected result: true
//Actual result: true

I have been looking around for answers and I have found a few that have similar issues with Swedish letters, none of them really look for only the word in its entirity. Could anyone explain what I am doing wrong?


Solution

  • The word boundary \b strictly depends on the characters matched by \w, which is a short-hand character class for [A-Za-z0-9_].

    For obtaining a similar behaviour you must re-implement its functionality, for example like this:

    const swedishCharClass = '[a-zäöå]';
    const doesWordExist = (s, word) => new RegExp(
        '(?<!' + swedishCharClass + ')' + word + '(?!' + swedishCharClass + ')', 'i'
    ).test(s);
    
    console.log(doesWordExist("Färg på plagg",  "på"));  // true
    console.log(doesWordExist("Färg pås plagg", "pås")); // true
    console.log(doesWordExist("Färg pås plagg", "på"));  // false

    For more complex alphabets, I'd suggest you to take a look at Concrete Javascript Regex for Accented Characters (Diacritics).