We have a product which requires the ability to search, and has primarily been english-focused. As such, tokenization on spaces works relatively well (despite not always being the best idea).
We are recently expanding into the Japanese market and have found a number of complicating factors. Japanese has 2 key gotchas: 1) wordsCanBeStrungTogetherWithoutSpaces 2) Japanese uses different punctuation symbols
We have a workaround for 1, but having a "word" with a few hundred characters causes some complications, so it would be ideal to solve for (2). In the strictest sense I am trying to solve for Japanese, but realistically I would like a way to at least split up sentences regardless of alphabet. Is there a regex that is good for splitting based on a unicode range? Or will it need to be custom and including every different language?
Quick searching reveals https://unicodelookup.com/#full%20stop/1 it seems that the various "full stop"s are without pattern (as far as I can tell), but there aren't many, and I could build to match those. My concern is that there are edge cases that I don't know that I don't know about.
It looks like the unicode categories are actually well designed for this. The following regex seems to work fine:
[\p{L}\p{Nd}]+
https://regex101.com/r/YEgUQ3/2
And has a simple explanation:
\p{L} matches any kind of letter from any language
\p{Nd} matches a digit zero through nine in any script except ideographic scripts
Where apparently letter
means strictly not punctuation. And ideographic numbers seem to be just words.