Search code examples
runicodecharacter-encodingnlplinguistics

Handling count of characters with diacritics in R


I'm trying to get the number of characters in strings with characters with diacritics, but I can't manage to get the right result.

> x <- "n̥ala"
> nchar(x)
[1] 5

What I want to get is is 4, since should be considered one character (i.e. diacritics shouldn't be considered characters on their own, even with more than one diacritic stacked on a base character).

How can I get this kind of result?


Solution

  • Here is my solution. The idea is that phonetic alphabets can have an unicode representation and then:

    Use Unicode package; it provide the function Unicode_alphabetic_tokenizer that:

    Tokenization first replaces the elements of x by their Unicode character sequences. Then, the non- alphabetic characters (i.e., the ones which do not have the Alphabetic property) are replaced by blanks, and the corresponding strings are split according to the blanks.

    After this I used the nchar but because the splitting it two substrings of the previous function I used a sum.

    sum(nchar(Unicode_alphabetic_tokenizer(x)))
    [1] 4
    

    I believe this package can be very useful in such cases, but I am not an expert and I do not know if my solution works for all problems that involve phonetic alphabets. Maybe other examples might be useful to state the validity of my solution.

    It works well

    Here is another example:

    > x <- "e̯ ʊ̯"
    > x
    [1] "e̯ ʊ̯"
    > nchar(x)
    [1] 5
    > sum(nchar(Unicode_alphabetic_tokenizer(x)))
    [1] 2
    

    p.s. there is only one " in the code but copying and pasting it, the second one appears. I do not know why this happens.