For Chinese punctuation chars like ~
,
。
, how to detect via Go?
I tried with range table of package unicode
like the code below, but Han
doesn't include those punctuation chars.
Can you please tell me which range table should I use for this task? (Please refraining from using regex
because it's low performance.)
for _, r := range strToDetect {
if unicode.Is(unicode.Han, r) {
return true
}
}
Puctuation marks are scattered about in different Unicode code blocks.
The Unicode® Standard
Version 14.0 – Core SpecificationChapter 6
Writing Systems and Punctuation
https://www.unicode.org/versions/latest/ch06.pdfPunctuation. The rest of this chapter deals with a special case: punctuation marks, which tend to be scattered about in different blocks and which may be used in common by many scripts. Punctuation characters occur in several widely separated places in the blocks, including Basic Latin, Latin-1 Supplement, General Punctuation, Supplemental Punctuation, and CJK Symbols and Punctuation. There are also occasional punctuation characters in blocks for specific scripts.
Here are two of your examples,
〜 Wave Dash U+301C
。Ideographic Full Stop U+3002
package main
import (
"fmt"
"unicode"
)
func main() {
// CJK Symbols and Punctuation Unicode block
for r := rune('\u3000'); r <= '\u303F'; r++ {
if unicode.IsPunct(r) {
fmt.Printf("%[1]U\t%[1]c\n", r)
}
}
}
https://go.dev/play/p/WoJjM6JKTYR
U+3001 、
U+3002 。
U+3003 〃
U+3008 〈
U+3009 〉
U+300A 《
U+300B 》
U+300C 「
U+300D 」
U+300E 『
U+300F 』
U+3010 【
U+3011 】
U+3014 〔
U+3015 〕
U+3016 〖
U+3017 〗
U+3018 〘
U+3019 〙
U+301A 〚
U+301B 〛
U+301C 〜
U+301D 〝
U+301E 〞
U+301F 〟
U+3030 〰
U+303D 〽