Search code examples
stringgocharacterstring-length

How to get the number of characters in a string


How can I get the number of characters of a string in Go?

For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.


Solution

  • You can try RuneCountInString from the utf8 package.

    returns the number of runes in p

    that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:

    package main
        
    import "fmt"
    import "unicode/utf8"
        
    func main() {
        fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
    }
    

    Phrozen adds in the comments:

    Actually you can do len() over runes by just type casting.
    len([]rune("世界")) will print 2. At least in Go 1.3.


    And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)

    The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.

    Adds a new runtime function to count runes in a string. Modifies the compiler to detect the pattern len([]rune(string)) and replaces it with the new rune counting runtime function.

    RuneCount/lenruneslice/ASCII        27.8ns ± 2%  14.5ns ± 3%  -47.70%
    RuneCount/lenruneslice/Japanese     126ns ± 2%   60  ns ± 2%  -52.03%
    RuneCount/lenruneslice/MixedLength  104ns ± 2%   50  ns ± 1%  -51.71%
    

    Stefan Steiger points to the blog post "Text normalization in Go"

    What is a character?

    As was mentioned in the strings blog post, characters can span multiple runes.
    For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.

    The definition of a character may vary depending on the application.
    For normalization we will define it as:

    • a sequence of runes that starts with a starter,
    • a rune that does not modify or combine backwards with any other rune,
    • followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).

    The normalization algorithm processes one character at at time.

    Using that package and its Iter type, the actual number of "character" would be:

    package main
        
    import "fmt"
    import "golang.org/x/text/unicode/norm"
        
    func main() {
        var ia norm.Iter
        ia.InitString(norm.NFKD, "école")
        nc := 0
        for !ia.Done() {
            nc = nc + 1
            ia.Next()
        }
        fmt.Printf("Number of chars: %d\n", nc)
    }
    

    Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"


    Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.

    For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.

    That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.

    package uniseg
        
    import (
        "fmt"
        
        "github.com/rivo/uniseg"
    )
        
    func main() {
        gr := uniseg.NewGraphemes("👍🏼!")
        for gr.Next() {
            fmt.Printf("%x ", gr.Runes())
        }
        // Output: [1f44d 1f3fc] [21]
    }
    

    Two graphemes, even though there are three runes (Unicode code points).

    You can see other examples in "How to manipulate strings in GO to reverse them?"

    👩🏾‍🦰 alone is one grapheme, but, from unicode to code points converter, 4 runes: