I am going through Go By Example, and the strings and runes section is terribly confusing.
Running this:
sample := "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
fmt.Println(sample)
fmt.Printf("%%q: %q\n", sample)
fmt.Printf("%%+q: %+q\n", sample)
yields this:
��=� ⌘
%q: "\xbd\xb2=\xbc ⌘"
%+q: "\xbd\xb2=\xbc \u2318"
..which is fine. The 1st, 2nd and 4th rune seem to be non-printable, which I guess means that \xbd
, \xb2
and \xbc
are simply not supported by Unicode or something (correct me if im wrong), and so they show up as �. Both %q
and %+q
also correctly escape those 3 non-printable runes.
But now when I iterate over the string like so:
for _, runeValue := range sample {
fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
}
suddenly the 3 non-printable runes are not escaped by %q
and remain as �, and %+q
attempts to reveal their underlying code point, which is obviously incorrect:
fffd, '�', '\ufffd'
fffd, '�', '\ufffd'
3d, '=' , '='
fffd, '�', '\ufffd'
20, ' ' , ' '
2318, '⌘', '\u2318'
Even stranger, if I iterate over the string as a byte slice:
for _, runeValue := range []byte(sample) {
fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
}
suddenly, these runes are no longer non-printable, and their underlying code points are correct:
bd, '½', '\u00bd'
b2, '²', '\u00b2'
3d, '=', '='
bc, '¼', '\u00bc'
20, ' ', ' '
e2, 'â', '\u00e2'
8c, '\u008c', '\u008c'
98, '\u0098', '\u0098'
Can someone explain whats happening here?
fmt.Printf
will do lot of magic under the covers to render as much useful information via type inspection etc. If you want to verify if a string (or a byte slice) is valid UTF-8
use the standard library package encoding/utf8
.
For example:
import "unicode/utf8"
var sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
fmt.Printf("%q valid? %v\n", sample, utf8.ValidString(sample)) // reports "false"
Scanning the individual runes of the string we can identify what makes this string invalid (from a UTF-8
encoding perspective). Note: the hex value 0xfffd
indicates a bad rune was encounter. This error value is defined as a package constant utf8.RuneError:
for _, r := range sample {
validRune := r != utf8.RuneError // is 0xfffd? i.e. bad rune?
if validRune {
fmt.Printf("'%c' validRune: true hex: %4x\n", r, r)
} else {
fmt.Printf("'%c' validRune: false\n", r)
}
}
https://go.dev/play/p/9NO9xMvcxCp
produces:
'�' validRune: false
'�' validRune: false
'=' validRune: true hex: 3d
'�' validRune: false
' ' validRune: true hex: 20
'⌘' validRune: true hex: 2318