Search code examples
javascriptnode.jsstringutf-8tokenize

Tokenize sentence in word in UTF-8 with special characters


I am trying to tokenize the words in a sentence in Javascript (node.js), removing non-word chars (point, comma, period, question mark, etc.), taking into account every existing alphabet in UTF-8 (english, latin, greek, cyrillic, japanese, etc.) and numbers, that means I need a way to know if a character is a part of a word, or is a symbol. For example:

españa.es 4*5 Rußland Citroën, 東京 iphone-pro5

should return an array:

[españa, es, 4, 5, Rußland, Citroën, 東京, iphone, pro5]

I'm using the following regular expression:

[0-9A-Za-zªº\u00B5\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02AF\u02B0-\u02C1\u0370-\u0374\u0376-\u0377\u037A-\u037D\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03FF\u0400-\u0481\u048A-\u0523]+

0-9A-Za-z (numbers and English letters)

ªº (ordinal indicators)

\u00B5 (micro sign)

\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02AF (non-english latin letters)

\u02B0-\u02C1 (modifier letters)

\u0370-\u0374\u0376-\u0377\u037A-\u037D\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03FF (greek and Coptic alphabets)

\u0400-\u0481\u048A-\u0523 (cyrillic alphabet)

Is there any other easier or more complete way to split words in a text?


Solution

  • It is easy with XRegExp:

    var s = "españa.es 4*5 Rußland Citroën, 東京 iphone-pro5";
    var r = XRegExp("[\\pL\\pN]+", "g");
    var results = XRegExp.match(s, r);
    console.log(results);
    <script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>

    The [\\pL\\pN]+ regex pattern matches one or more Unicode letters (\pL) or digits (\pN).