A string
in go is a collection of immutable byte
s. A byte
is an alias for uint8
. A rune
is an alias for int32
that is used to store characters.
Why do rune
s use int32
s, instead of uin32
s? There is no such thing known as a negative character.
string
s use byte
s, in which each byte
is enough to store ascii characters, but not unicode characters. How ever, go can store unicode characters in strings, but indexing a character it loses it's data. You can't convert a float64
to an int
implicitly in go, since it might lose that, but this conversion of indexing a string
, containing a unicode character, does not raise any errors and just loses its data. How can I index a rune
out of a string
, instead of a byte
?
Consider the following program and its output.
package main
import (
"fmt"
)
func main() {
x := "ඞ"
y := x[0]
z := 'ඞ'
fmt.Printf("%s vs %c vs %c\n", x, y, z)
}
ඞ vs à vs ඞ
What I feel like a string
does for storing unicode characters is combining bytes, since it's possible to index 1 out of x
as well.
To take your questions in turn...
I suspect this may be something to do with native representations of ints at the machine level which may be optimised for signed ints vs unsigned.
But ultimately it does not matter.
First of all, Unicode codepoints (currently at least) only use the range 0x0000 to 0x10ffff. i.e. you will never encounter a negative rune when dealing with legitimate Unicode.
If there was such a thing as a int24
, this would be sufficient. The upper 8 bits (where the sign bit resides, obviously) are unused by Unicode (codepoints).
so it could be that this is the reason for using int32
and has nothing to do with "optimisation".
But even if the Unicode specification were to expand to the full 32-bit range, this still would not present a problem.
Whether signed or unsigned, the internal representation would be consistent. So, for example, if some go code were to exchange runes with some other code and if that other code is using an unsigned type, there would be no problem since fundamentally what is being exchanged are the 32 bits in each rune, not the interpretation overlaid on those 32 bits by any particular type.
The sign might be important if performing arithmetic using runes, though if you were doing that I would expect you would have a deep understanding of runes and how to manipulate them safely (presumably for the purposes of some form of cryptography - I can't think of any other reason for doing rune arithmetic).
No, indexing a byte in a string (which is just a []byte
) gives you precisely the data you asked for: the 1, single specified byte.
Nothing is lost (or gained).
If you want a rune represented by a sequence of bytes in a string then you need to ask for all of the bytes that represent that rune
.
First convert the string ([]byte
) to []rune
, then index as you would any other slice. So, given a string s
and wishing to obtain the i
th rune:
r := []rune(s)[i]