Search code examples
regexnlptokenize

A stable regular expression or simple library for multi-lingual tokenization?


We have a product which requires the ability to search, and has primarily been english-focused. As such, tokenization on spaces works relatively well (despite not always being the best idea).

We are recently expanding into the Japanese market and have found a number of complicating factors. Japanese has 2 key gotchas: 1) wordsCanBeStrungTogetherWithoutSpaces 2) Japanese uses different punctuation symbols

We have a workaround for 1, but having a "word" with a few hundred characters causes some complications, so it would be ideal to solve for (2). In the strictest sense I am trying to solve for Japanese, but realistically I would like a way to at least split up sentences regardless of alphabet. Is there a regex that is good for splitting based on a unicode range? Or will it need to be custom and including every different language?

Quick searching reveals https://unicodelookup.com/#full%20stop/1 it seems that the various "full stop"s are without pattern (as far as I can tell), but there aren't many, and I could build to match those. My concern is that there are edge cases that I don't know that I don't know about.


Solution

  • It looks like the unicode categories are actually well designed for this. The following regex seems to work fine:

    [\p{L}\p{Nd}]+ https://regex101.com/r/YEgUQ3/2

    And has a simple explanation:

    \p{L} matches any kind of letter from any language
    \p{Nd} matches a digit zero through nine in any script except ideographic scripts
    

    Where apparently letter means strictly not punctuation. And ideographic numbers seem to be just words.