How do I tokenise a string with a fixed set of symbols (zhuyin), punctuations, and English characters into zhuyin sequences (space delimited but sometimes joined by punctuations or English characters), individual punctuations, and English character sequences?
For a string as such
"ㄉㄢˋNCCㄗㄞˋ『ㄅㄠˇ ㄏㄨˋ』ㄍㄜ˙ ㄗ,ㄉㄜ˙「ㄑㄧㄢˊ ㄊㄧˊ」ㄒㄧㄚˋ。"
How do I tokenise it into
['ㄉㄢˋ', 'NCC', ㄗㄞˋ', '『', 'ㄅㄠˇ', 'ㄏㄨˋ', '』', 'ㄍㄜ˙', 'ㄗ', ',', 'ㄉㄜ˙', '「', 'ㄑㄧㄢˊ', 'ㄊㄧˊ', '」', 'ㄒㄧㄚˋ', '。']
I'm currently using list comprehension and regex pattern as such
[seq for seq in re.split("([^\w˙])", input_str) if seq and seq != " "]
but this fails to tokenise English character sequences and produces results like this
['ㄉㄢˋNCCㄗㄞˋ', '『', 'ㄅㄠˇ', 'ㄏㄨˋ', '』', 'ㄍㄜ˙', 'ㄗ', ',', 'ㄉㄜ˙', '「', 'ㄑㄧㄢˊ', 'ㄊㄧˊ', '」', 'ㄒㄧㄚˋ', '。']
You could use regex
module instead of re
and use alternation between Zhuyin (or Bopomofo), Latin and punctuation marks. For example:
\p{Bopomofo}+[ˋˇ˙ˊ]?|\p{Latin}+|\p{P}
See the Online Demo
\p{Bopomofo}+
- Any one or more Zhuyin character.[ˋˇ˙ˊ]?
- An optional character from the given characters.
|
- Alternate (OR)\p{Latin}+
- Any one or more Latin character (would capture the comma too).
|
- Alternate (OR)\p{P}
- Any kind of punctuation character.import regex
text = 'ㄉㄢˋNCCㄗㄞˋ『ㄅㄠˇ ㄏㄨˋ』ㄍㄜ˙ ㄗ,ㄉㄜ˙「ㄑㄧㄢˊ ㄊㄧˊ」ㄒㄧㄚˋ。'
lst = regex.findall(r'\p{Bopomofo}+[ˋˇ˙]?|\p{Latin}+|\p{P}', text)
print(lst)
Results in:
['ㄉㄢˋ', 'NCC', 'ㄗㄞˋ', '『', 'ㄅㄠˇ', 'ㄏㄨˋ', '』', 'ㄍㄜ˙', 'ㄗ', ',', 'ㄉㄜ˙', '「', 'ㄑㄧㄢˊ', 'ㄊㄧˊ', '」', 'ㄒㄧㄚˋ', '。']