Regex split at all punctuations and English character sequences and keep delimiters with zhuyin in Python

How do I tokenise a string with a fixed set of symbols (zhuyin), punctuations, and English characters into zhuyin sequences (space delimited but sometimes joined by punctuations or English characters), individual punctuations, and English character sequences?

For a string as such

"ㄉㄢˋＮＣＣㄗㄞˋ『ㄅㄠˇ ㄏㄨˋ』ㄍㄜ˙ ㄗ,ㄉㄜ˙「ㄑㄧㄢˊ ㄊㄧˊ」ㄒㄧㄚˋ。"

How do I tokenise it into

['ㄉㄢˋ', 'ＮＣＣ', ㄗㄞˋ', '『', 'ㄅㄠˇ', 'ㄏㄨˋ', '』', 'ㄍㄜ˙', 'ㄗ', ',', 'ㄉㄜ˙', '「', 'ㄑㄧㄢˊ', 'ㄊㄧˊ', '」', 'ㄒㄧㄚˋ', '。']

I'm currently using list comprehension and regex pattern as such

[seq for seq in re.split("([^\w˙])", input_str) if seq and seq != " "]

but this fails to tokenise English character sequences and produces results like this

['ㄉㄢˋＮＣＣㄗㄞˋ', '『', 'ㄅㄠˇ', 'ㄏㄨˋ', '』', 'ㄍㄜ˙', 'ㄗ', ',', 'ㄉㄜ˙', '「', 'ㄑㄧㄢˊ', 'ㄊㄧˊ', '」', 'ㄒㄧㄚˋ', '。']

Solution

You could use regex module instead of re and use alternation between Zhuyin (or Bopomofo), Latin and punctuation marks. For example:

\p{Bopomofo}+[ˋˇ˙ˊ]?|\p{Latin}+|\p{P}

See the Online Demo

\p{Bopomofo}+ - Any one or more Zhuyin character.
[ˋˇ˙ˊ]? - An optional character from the given characters.
- | - Alternate (OR)
\p{Latin}+ - Any one or more Latin character (would capture the comma too).
- | - Alternate (OR)
\p{P} - Any kind of punctuation character.

import regex
text = 'ㄉㄢˋＮＣＣㄗㄞˋ『ㄅㄠˇ ㄏㄨˋ』ㄍㄜ˙ ㄗ,ㄉㄜ˙「ㄑㄧㄢˊ ㄊㄧˊ」ㄒㄧㄚˋ。'
lst = regex.findall(r'\p{Bopomofo}+[ˋˇ˙]?|\p{Latin}+|\p{P}', text)
print(lst)

Results in:

['ㄉㄢˋ', 'ＮＣＣ', 'ㄗㄞˋ', '『', 'ㄅㄠˇ', 'ㄏㄨˋ', '』', 'ㄍㄜ˙', 'ㄗ', ',', 'ㄉㄜ˙', '「', 'ㄑㄧㄢˊ', 'ㄊㄧˊ', '」', 'ㄒㄧㄚˋ', '。']