Search code examples
notepad++regex-lookaroundsregex-negation

Can negation in regex in Notepad++ match emoji or other unicode char at U+10000 and above outside the Basic Multilingual Plane (BMP)?


It seems the regex engine used by Notepad++ can't do what I thought it would. Maybe a more general problem, not only with negation syntax.

Example with unicode number U+1F3B5, Unicode Name "Musical Note":

Successful regex WITHOUT negation that is already matching what I want:

videoPrimaryInfoRenderer":{"title":{"runs":\[{"text":"\K.+?(?=")

Example text that includes something I want to match:

}},"contents":{"twoColumnWatchNextResults":{"results":{"results":{"contents":[{"videoPrimaryInfoRenderer":{"title":{"runs":[{"text":"【みんなのリズム会場】ノリノリリズムパーティーはこちらです!🎵【天国】"}]},"viewCount":{"videoViewCountRenderer":

The part that I want to match (the regex above gets this):

【みんなのリズム会場】ノリノリリズムパーティーはこちらです!🎵【天国】

includes

🎵

emoji. So the "dot" DOES match characters above U+10000 in the last part of my regex:

.+?

and then the lookahead

(?=")

ends the match before the first ".


AFTER:

videoPrimaryInfoRenderer":{"title":{"runs":\[{"text":"\K

with the \K modifier to "forget" about that part and select whatever I put at the end of that regex...

Example regex with negation I tried:

Example 1:

[^"]+

since there is no " in the part I want to match, matched:

【みんなのリズム会場】ノリノリリズムパーティーはこちらです!

Example 2:

((?!").)+

matched the same as example 1. Not surprising, since it's the same idea of excluding " but with negative lookahead.

Both types of "match OTHER THAN specified character" stop before the emoji.

Notepad++ v8.6.5 (32-bit)

I would appreciate an explanation.


Solution

  • TLDR:

    You may use

    videoPrimaryInfoRenderer":{"title":{"runs":\[{"text":"\K(?:(?!").[\x{DC00}-\x{DFFF}]?)+
    

    You can refer to Regexp fails to match UTF-8 characters Notepad++ Community post:

    Unfortunately, in character classes like you mentioned, that means that the characters outside the BMP (at U+10000 and above), while they can be found by ^.+, cannot be found by something that seems equivalent, like ^[\s\S]+

    Problems arise when searching Unicode characters which are over the Basic Multilingual plane ( BMP ) which have a code-point between \x{10000} and \x{10FFFF} ( so over \x{FFFF} )

    For instance, as the code-point of the emoticon 🤣 is over \x{FFFF} :

    • It cannot be represented with its real regex syntax \x{1F923}, due a bug of the present Boost regex engine, which does not handle all characters in true 32-bits encoding, but only with the UTF-16 encoding:-(( So, searching for \x{1F4A6} results in the error message Find: Invalid regular expression
    • Moreover, the simple regex dot symbol (?-s). cannot match a character, with Unicode code-point > \x{FFFF}, too !
    • Of course if you paste your character, directly, in the Find what: zone, it does find all occurrences of the ROLLING ON THE FLOOR LAUGHING character !

    Luckily, the coding of characters of our Boost regex engine in UTF-16 allows to code all characters, with code-point over \x{FFFF}, thanks to the surrogates mechanism. Refer to generalities, below :

    https://en.wikipedia.org/wiki/UTF-16

    In short, the surrogate pair of a character, with Unicode code-point in range from \x{10000} till \x{10FFFF}, can be described by the regex :

    \x{hhhh}\x{iiii} where D800 < hhhh < DBFF and DC00 < iiii < DFFF

    So if a regex, involves the surrogates pair ( two 16-bit units ) of a character, which is over the BMP, our regex engine is able to match it. For instance, as the surrogates pair of the character ROLLING ON THE FLOOR LAUGHING is D83E DD23, the regex \x{D83E}\x{DD23} does find all occurrences of your emoticon character !

    and recently I proposed a Notepad++ macro which replaces any selection of the \xhhhhh syntaxes with their surrogate pair equivalents \x{Dhhh}\x{Diii} ! See below :
    https://community.notepad-plus-plus.org/post/57528

    The summary:

    In summary, because of the use of UTF-16, instead of UTF-32, by the present implementation of the Boost Regex library, within N++ :

    • Use the simple regex (?-s). to match any standard character, from \x{0000} to \x{FFFF} ( so not including the EOL chars nor the Form Feed char \x0c )

    • IMPORTANT : From the surrogates mechanism, explained above, one may think that the regex [\x{D800}-\x{DBFF][\x{DC00}-\x{DFFF}] should find all the characters with Unicode code-point over \x{FFFF}. Unfortunately, this syntax does not work !? So, we need to use these derived regexes :

    • (?-s).[\x{DC00}-\x{DFFF}] to match any standard character from \x{10000} to \x{10FFFF}

    • (?-s).[\x{DC00}-\x{DFFF}]? to match all standard characters, from \x{0000} to \x{10FFFF}

    And :

    • To match a specific character of the BMP, from \x{0000} to \x{FFFF}, use the regex syntax \x{hhhh}, with four hexadecimal numbers

    • To match a specific character over the BMP, from \x{10000} to \x{10FFFF}, use the high and low surrogates equivalent pair, with the regex syntax \x{<high>}\x{<low>}, replacing the <high> and <low> values with their exact hexadecimal values, using each 4 hexadecimal numbers