Search code examples
stringunicodegoemojistring-length

Counting characters in golang string


I am trying to count "characters" in go. That is, if a string contains one printable "glyph", or "composed character" (or what someone would ordinarily think of as a character), I want it to count 1. For example, the string "Hello, 世🖖🏿🖖界", should count 11, since there are 11 characters, and a human would look at this and say there are 11 glyphs.

utf8.RuneCountInString() works well in most cases, including ascii, accents, asian characters and even emojis. However, as I understand it runes correspond to code points, not characters. When I try to use basic emojis it works, but when I use emojis that have different skin tones, I get the wrong count: https://play.golang.org/p/aFIGsB6MsO

From what I read here and here the following should work, but I still don't seem to be getting the right results (it over-counts):

func CountCharactersInString(str string) int {
    var ia norm.Iter
    ia.InitString(norm.NFC, str)
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    return nc
}

This doesn't work either:

func GraphemeCountInString(str string) int {
    re := regexp.MustCompile("\\PM\\pM*|.")
    return len(re.FindAllString(str, -1))
}

I am looking for something similar to this in Objective C:

+ (NSInteger)countCharactersInString:(NSString *) string {
    // --- Calculate the number of characters enterd by user and update character count label
    NSInteger count = 0;
    NSUInteger index = 0;
    while (index < string.length) {
        NSRange range = [string rangeOfComposedCharacterSequenceAtIndex:index];
        count++;
        index += range.length;
    }
    return count;
 }

Solution

  • I wrote a package that allows you to do this: https://github.com/rivo/uniseg. It breaks strings according to the rules specified in Unicode Standard Annex #29 which is what you are looking for. Here is how you would use it in your case:

    package main
    
    import (
        "fmt"
    
        "github.com/rivo/uniseg"
    )
    
    func main() {
        fmt.Println(uniseg.GraphemeClusterCount("Hello, 世🖖🏿🖖界"))
    }
    

    This will print 11 as you expect.