Search code examples
javascriptregexunicode

Get Chinese punctuation in a string


With the answers in

I have gotten close to what I need: get all the Chinese punctuations in the string.

And Intl.Segmenter is much better than String.prototype.split(" ")

But with one problem /\p{P}/u.test(segment.segment) test all the punctuations, not just Chinese punctuation, so I get English punctuation like apostrophe, comma, question mark and period.

I hope I need not to resolve to the answer in Chinese punctuation Unicode range?. It is too complicated. According to this wiki about Chinese punctuation there are only about 20.

So is there any easy way to do that ?

const str = "你好,让我们试试这个分词效果,你说怎么样?Let's try Intl.Segmenter, should we ?"
let segmenterZH = new Intl.Segmenter('zh', { granularity: 'grapheme' })
let segments = segmenterZH.segment(str)
for (let segment of segments) {
  if (/\p{P}/u.test(segment.segment)) {
    console.log(`${segment.index}:${segment.segment}`)
  }
}

--- update ---

I would like to add some new finding, partially inspired by Use regular expression to match characters appearing in Traditional Chinese ONLY :

  1. If I want to get all the Chinese character without punctuations, I can use /\p{sc=Han}/ as https://javascript.info/regexp-unicode said.

  2. I further tried what /\p{scx=Han}/can get, as Script_Extensions explains, but I only got 2 more Chinese punctuations, 《 》 while missed other Chinese punctuations.

@WiktorStribiżew's answer may explain that as these two punctuations 《 》 fall in the range of CJK Symbols and Punctuation while other Chinese punctuations fall in Halfwidth and Fullwidth Forms range. But I still think that is a bug for /\p{scx=Han}/


Solution

  • In case you want to match punctuation proper that belongs to the CJK Symbols and Punctuation set, or a Halfwidth and Fullwidth Forms charset, you can use

    /\p{P}(?<=[\u3000-\u303F\uFF00-\uFFEF])/u
    

    where

    • \p{P} - matches any punctuation proper char (i.e. it does not match math symbols like + or =, etc.)
    • (?<=[\u3000-\u303F\uFF00-\uFFEF]) - a positive lookbehind that requires the char matched by \p{P} to fall in either the \u3000-\u303F (CJK Symbols and Punctuation) or \uFF00-\uFFEF (Halfwidth and Fullwidth Forms) range.

    See a JavaScript demo below:

    const str = "你好,让我们试试这个分词效果,你说怎么样?Let's try Intl.Segmenter, should we ?"
    let segmenterZH = new Intl.Segmenter('zh', { granularity: 'grapheme' })
    let segments = segmenterZH.segment(str)
    for (let segment of segments) {
      if (/\p{P}(?<=[\u3000-\u303F\uFF00-\uFFEF])/u.test(segment.segment)) {
        console.log(`${segment.index}:${segment.segment}`)
      }
    }

    Output:

    2:,
    14:,
    20:?
    

    v flag supoport scenario

    If your JavaScript environment supports the v flag, you can use a character class intersection:

    const str = "你好,让我们试试这个分词效果,你说怎么样?Let's try Intl.Segmenter, should we ?";
    for (let m of str.matchAll(/[\p{P}&&[\u3000-\u303F\uFF00-\uFFEF]]/gv)) {
       console.log(`${m.index}:${m[0]}`)
    }