I'm trying to get the number of characters in strings with characters with diacritics, but I can't manage to get the right result.
> x <- "n̥ala"
> nchar(x)
[1] 5
What I want to get is is 4
, since n̥
should be considered one character (i.e. diacritics shouldn't be considered characters on their own, even with more than one diacritic stacked on a base character).
How can I get this kind of result?
Here is my solution. The idea is that phonetic alphabets can have an unicode representation and then:
Use Unicode
package; it provide the function Unicode_alphabetic_tokenizer
that:
Tokenization first replaces the elements of x by their Unicode character sequences. Then, the non- alphabetic characters (i.e., the ones which do not have the Alphabetic property) are replaced by blanks, and the corresponding strings are split according to the blanks.
After this I used the nchar
but because the splitting it two substrings of the previous function I used a sum
.
sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 4
I believe this package can be very useful in such cases, but I am not an expert and I do not know if my solution works for all problems that involve phonetic alphabets. Maybe other examples might be useful to state the validity of my solution.
Here is another example:
> x <- "e̯ ʊ̯"
> x
[1] "e̯ ʊ̯"
> nchar(x)
[1] 5
> sum(nchar(Unicode_alphabetic_tokenizer(x)))
[1] 2
p.s.
there is only one "
in the code but copying and pasting it, the second one appears. I do not know why this happens.