According to several best practice documents, it is a good idea to check whether the input data is UTF-8 or not.
In my project, I use Gin and thus go-playground/validator for validation. There is an "ascii" validator, but no "utf-8" validator.
I found https://pkg.go.dev/unicode/utf8#ValidString, and I wondered if it would be of any assistance to check the inputs with that or is that given, since Go itself uses Unicode internally?
Here is an example:
package main
import (
"net/http"
"github.com/gin-gonic/gin"
)
type User struct {
Name string `json:"name" binding:"required,alphanum"`
}
func main() {
r := gin.Default()
r.POST("/user", createUserHandler)
r.Run()
}
func createUserHandler(c *gin.Context) {
var newUser User
err := c.ShouldBindJSON(&newUser)
if err != nil {
c.AbortWithError(http.StatusBadRequest, err)
return
}
c.Status(http.StatusCreated)
}
Is it ensured that after Calling c.ShouldBindJson that name in newUser
is UTF-8 encoded? Is there any advantage in checking name
with utf8.ValidString?
Gin uses the standard encoding/json package to unmarshal JSON documents. The documentation for that package says:
When unmarshaling quoted strings, invalid UTF-8 or invalid UTF-16 surrogate pairs are not treated as an error. Instead, they are replaced by the Unicode replacement character U+FFFD.
It is ensured that the decoded string values are valid UTF-8. There is no advantage to checking string values with utf8.ValidString.
Depending on the application requirements, you may want to check for and handle the Unicode replacement character, "�". Aside: As demonstrated by the � in this answer, SO handles the Unicode replacement character like any other character.
Go itself uses Unicode internally?
Some language features use UTF-8 encoding (range on string, conversions between []rune and string), but those features do not restrict the bytes that can be stored in a string. Strings can contain any sequence of bytes including invalid UTF-8.