Search code examples
pythonregextokenize

Regex split at all punctuations and English character sequences and keep delimiters with zhuyin in Python


How do I tokenise a string with a fixed set of symbols (zhuyin), punctuations, and English characters into zhuyin sequences (space delimited but sometimes joined by punctuations or English characters), individual punctuations, and English character sequences?

For a string as such

"ㄉㄢˋNCCㄗㄞˋ『ㄅㄠˇ ㄏㄨˋ』ㄍㄜ˙ ㄗ,ㄉㄜ˙「ㄑㄧㄢˊ ㄊㄧˊ」ㄒㄧㄚˋ。"

How do I tokenise it into

['ㄉㄢˋ', 'NCC', ㄗㄞˋ', '『', 'ㄅㄠˇ', 'ㄏㄨˋ', '』', 'ㄍㄜ˙', 'ㄗ', ',', 'ㄉㄜ˙', '「', 'ㄑㄧㄢˊ', 'ㄊㄧˊ', '」', 'ㄒㄧㄚˋ', '。']

I'm currently using list comprehension and regex pattern as such

[seq for seq in re.split("([^\w˙])", input_str) if seq and seq != " "]

but this fails to tokenise English character sequences and produces results like this

['ㄉㄢˋNCCㄗㄞˋ', '『', 'ㄅㄠˇ', 'ㄏㄨˋ', '』', 'ㄍㄜ˙', 'ㄗ', ',', 'ㄉㄜ˙', '「', 'ㄑㄧㄢˊ', 'ㄊㄧˊ', '」', 'ㄒㄧㄚˋ', '。']


Solution

  • You could use regex module instead of re and use alternation between Zhuyin (or Bopomofo), Latin and punctuation marks. For example:

    \p{Bopomofo}+[ˋˇ˙ˊ]?|\p{Latin}+|\p{P}
    

    See the Online Demo

    • \p{Bopomofo}+ - Any one or more Zhuyin character.
    • [ˋˇ˙ˊ]? - An optional character from the given characters.
      • | - Alternate (OR)
    • \p{Latin}+ - Any one or more Latin character (would capture the comma too).
      • | - Alternate (OR)
    • \p{P} - Any kind of punctuation character.

    import regex
    text = 'ㄉㄢˋNCCㄗㄞˋ『ㄅㄠˇ ㄏㄨˋ』ㄍㄜ˙ ㄗ,ㄉㄜ˙「ㄑㄧㄢˊ ㄊㄧˊ」ㄒㄧㄚˋ。'
    lst = regex.findall(r'\p{Bopomofo}+[ˋˇ˙]?|\p{Latin}+|\p{P}', text)
    print(lst)
    

    Results in:

    ['ㄉㄢˋ', 'NCC', 'ㄗㄞˋ', '『', 'ㄅㄠˇ', 'ㄏㄨˋ', '』', 'ㄍㄜ˙', 'ㄗ', ',', 'ㄉㄜ˙', '「', 'ㄑㄧㄢˊ', 'ㄊㄧˊ', '」', 'ㄒㄧㄚˋ', '。']