Golang web app security: should you check if input is valid utf-8?

According to several best practice documents, it is a good idea to check whether the input data is UTF-8 or not.

In my project, I use Gin and thus go-playground/validator for validation. There is an "ascii" validator, but no "utf-8" validator.

I found https://pkg.go.dev/unicode/utf8#ValidString, and I wondered if it would be of any assistance to check the inputs with that or is that given, since Go itself uses Unicode internally?

Here is an example:

package main

import (
    "net/http"

    "github.com/gin-gonic/gin"
)

type User struct {
    Name string `json:"name" binding:"required,alphanum"`
}

func main() {
    r := gin.Default()
    r.POST("/user", createUserHandler)
    r.Run()
}

func createUserHandler(c *gin.Context) {
    var newUser User
    err := c.ShouldBindJSON(&newUser)

    if err != nil {
        c.AbortWithError(http.StatusBadRequest, err)
        return
    }

    c.Status(http.StatusCreated)
}

Is it ensured that after Calling c.ShouldBindJson that name in newUser is UTF-8 encoded? Is there any advantage in checking name with utf8.ValidString?

Solution

Gin uses the standard encoding/json package to unmarshal JSON documents. The documentation for that package says:

When unmarshaling quoted strings, invalid UTF-8 or invalid UTF-16 surrogate pairs are not treated as an error. Instead, they are replaced by the Unicode replacement character U+FFFD.

It is ensured that the decoded string values are valid UTF-8. There is no advantage to checking string values with utf8.ValidString.

Depending on the application requirements, you may want to check for and handle the Unicode replacement character, "�". Aside: As demonstrated by the � in this answer, SO handles the Unicode replacement character like any other character.

Go itself uses Unicode internally?

Some language features use UTF-8 encoding (range on string, conversions between []rune and string), but those features do not restrict the bytes that can be stored in a string. Strings can contain any sequence of bytes including invalid UTF-8.