Search code examples
pythonstringunicodeutf-8character

getting last character from a string that may or may not be unicode


I'm parsing a file that contains both alpha strings and unicode/UTF-8 strings containing IPA pronunciations.

I want to be able to obtain the last character of a string, but sometimes those characters occupy two spaces, e.g.

syl = 'tyl'  # plain ascii
last_char = syl[-1]
# last char is 'l'

syl = 'tl̩'  # contains IPA char
last_char = syl[-1]
# last char erroneously contains: '̩' which is a diacritical mark on the l
# want the whole character 'l̩'

If I try using .decode(), it fails with:

'str' object has no attribute 'decode'

How to obtain the last character of the Unicode/UTF-8 string (when you don't know if it's Ascii or Unicode string)?

I guess I could use a lookup table to known characters and if it fails, go back and grab syl[-2:]. Is there an easier way?


In response to some comments, here is the complete list of IPA characters I've collected so far:

a, b, d, e, f, f̩, g, h, i, i̩, i̬,
j, k, l, l̩, m, n, n̩, o, p, r, s,
s̩, t, t̩, t̬, u, v, w, x, z, æ, ð,
ŋ, ɑ, ɑ̃, ɒ, ɔ, ə, ɚ, ɛ, ɜ, ɜ˞, ɝ,
ɡ, ɪ, ɵ, ɹ, ɾ, ʃ, ʃ̩, ʊ, ʌ, ʒ, ʤ,
θ, ∅

Solution

  • Here's a solution that works though it includes a hack to handle the rhotic hook

    def get_last_character_and_length(s):
      matches = regex.findall(r'[\w\W][\u0300-\u036f\u02B0-\u02FF]*˞?', s)
      last_character = matches[-1] if matches else None
      return last_character, len(last_character) if last_character else 0
    

    examples

        syl1 = 'tyl'  # plain ascii
        c, c_l = get_last_character_and_length(syl1)
        assert(c == 'l')
        assert(c_l == 1)
    
        syl2 = 'tl̩'  # contains IPA
        c, c_l = get_last_character_and_length(syl2)
        assert(c == 'l̩')
        assert(c_l == 2)
    
        syl3 = 'stɜ˞' # contains rhotic hook
        c, c_l = get_last_character_and_length(syl3)
        assert(c == 'ɜ˞')
        assert(c_l == 2)