Search code examples
stringgoindexingunicodecharacter

questions about runes, strings & unicode characters in go


A string in go is a collection of immutable bytes. A byte is an alias for uint8. A rune is an alias for int32 that is used to store characters.

Why do runes use int32s, instead of uin32s? There is no such thing known as a negative character.

strings use bytes, in which each byte is enough to store ascii characters, but not unicode characters. How ever, go can store unicode characters in strings, but indexing a character it loses it's data. You can't convert a float64 to an int implicitly in go, since it might lose that, but this conversion of indexing a string, containing a unicode character, does not raise any errors and just loses its data. How can I index a rune out of a string, instead of a byte?

Consider the following program and its output.

package main

import (
    "fmt"
)

func main() {
    x := "ඞ"
    y := x[0]

    z := 'ඞ'

    fmt.Printf("%s vs %c vs %c\n", x, y, z)
}
ඞ vs à vs ඞ

What I feel like a string does for storing unicode characters is combining bytes, since it's possible to index 1 out of x as well.


Solution

  • To take your questions in turn...

    Why is rune a int32 rather than uint32?

    I suspect this may be something to do with native representations of ints at the machine level which may be optimised for signed ints vs unsigned.

    But ultimately it does not matter.

    First of all, Unicode codepoints (currently at least) only use the range 0x0000 to 0x10ffff. i.e. you will never encounter a negative rune when dealing with legitimate Unicode.

    If there was such a thing as a int24, this would be sufficient. The upper 8 bits (where the sign bit resides, obviously) are unused by Unicode (codepoints).

    so it could be that this is the reason for using int32 and has nothing to do with "optimisation".

    But even if the Unicode specification were to expand to the full 32-bit range, this still would not present a problem.

    Whether signed or unsigned, the internal representation would be consistent. So, for example, if some go code were to exchange runes with some other code and if that other code is using an unsigned type, there would be no problem since fundamentally what is being exchanged are the 32 bits in each rune, not the interpretation overlaid on those 32 bits by any particular type.

    The sign might be important if performing arithmetic using runes, though if you were doing that I would expect you would have a deep understanding of runes and how to manipulate them safely (presumably for the purposes of some form of cryptography - I can't think of any other reason for doing rune arithmetic).

    Indexing a Byte in a String "loses data"

    No, indexing a byte in a string (which is just a []byte) gives you precisely the data you asked for: the 1, single specified byte.

    Nothing is lost (or gained).

    If you want a rune represented by a sequence of bytes in a string then you need to ask for all of the bytes that represent that rune.

    Indexing a Rune in a String

    First convert the string ([]byte) to []rune, then index as you would any other slice. So, given a string s and wishing to obtain the ith rune:

    r := []rune(s)[i]