Search code examples
regexunicodeword-boundary

Regex word-break with unicode diacritics


I am working on an application that searches text using regular expressions based on input from a user. One option the user has is to include a "Match 0 or more characters" wildcard using the asterisk. I need this to only match between word boundaries. My first attempt was to convert all asterisks to (?:(?=\B).)*, which works fine for most cases. Where it fails is that apparently .Net considers the position between a unicode character with a diacritic and another character a word-break. I consider this a bug, and have submitted it to the Microsoft feedback site.

In the meantime, however, I need to get the functionality implemented and product shipped. I am considering using [\p{L}\p{M}\p{N}\p{Pc}]* as the replacement text, but, frankly, am in "I don't really understand what this is going to do" land. I mean, I can read the specifications, but am not confident that I could sufficiently test this to make sure it is doing what I expect. I simply wouldn't know all the boundary conditions to test. The application is used by cross-cultural workers, many of whom are in tribal locations, so any and all writing systems need to be supported, including some that use zero-width word breaks.

Does anyone have a more elegant solution, or could confirm/correct the code above, or offer some pointers?

Thanks for your help.


Solution

  • The equivalent of /(?:(?=\B).)*/ in a unicode context would be:

    /
    (?:
      (?: (?<=[\p{L}\p{M}\p{N}\p{Pc}]) (?=[\p{L}\p{M}\p{N}\p{Pc}])
      |   (?<![\p{L}\p{M}\p{N}\p{Pc}]) (?![\p{L}\p{M}\p{N}\p{Pc}])
      )
      .
    )*
    /
    

    ...or somewhat simplified:

    /(?:[\p{L}\p{M}\p{N}\p{Pc}]+|[^\p{L}\p{M}\p{N}\p{Pc}]+)?/
    

    This would match either a word or a non-word (spacing, punctuation etc.) sequence, possibly an empty one.

    A normal or negated word-boundary (\b or \B) is basically a double look-around. One looking behind, making sure of the type of character that precedes the current position. Similarly one looking ahead.

    In the second regex, I removed the look-arounds and used simple character classes instead.