Working on my app I've stumbled into a problem of Aeson not decoding UTF8 input. Digging deeper I found out that it relies on Parser ByteString
of Attoparsec, which seems to be the source of the problem to me. But it's actually not what I'm asking here about.
The thing is it's not the only place I've seen people using ByteString
where, as it seems obvious to me, only Text
is appropriate, because JSON is not some binary file, it is a readable text and it may very well contain UTF8 characters.
So I am wondering whether I'm missing something and there are valid reasons to choose ByteString
over Text
or it is simply a widespread phenomenon of a bad library design caused by majority of people caring less about any other character sets than latin.
I think your problem is just a misunderstanding.
Prelude> print "Ёжик лижет мёд."
"\1025\1078\1080\1082 \1083\1080\1078\1077\1090 \1084\1105\1076."
Prelude> putStrLn "\1025\1078\1080\1082 \1083\1080\1078\1077\1090 \1084\1105\1076."
Ёжик лижет мёд.
Prelude> "{\"a\": \"Ёжик лижет мёд.\"}"
"{\"a\": \"\1025\1078\1080\1082 \1083\1080\1078\1077\1090 \1084\1105\1076.\"}"
When you print
a value containing a String
, the Show
instance for Char
is used, and that escapes all characters with code points above 127. To get the glyphs you want, you need to putStr[Ln]
the String
.
So aeson
properly decoded the utf8-encoded input, as should be expected because it utf8-encodes the values itself:
encode = {-# SCC "encode" #-} encodeUtf8 . toLazyText . fromValue .
{-# SCC "toJSON" #-} toJSON
So to the question why aeson
uses ByteString
and not Text
for the final target of encoding and starting point of decoding.
Because that is the appropriate type. The encoded values are intended to be transferred portably between machines. That happens as a stream of bytes (octets, if we're in pedantic mood). That is exactly what a ByteString
provides, a sequence of bytes that then have to be treated in an application-specific way. For the purposes of aeson
, the stream of bytes shall be encoded in utf-8, and aeson
assumes the input of the decode
function is valid utf-8, and encodes its output as valid utf-8.
Transferring e.g. Text
would run into portability problems, since a 16-bit encoding depends on endianness, so Text
is not an appropriate format for interchange of data between machines. Note that aeson
uses Text
as an intermediate type when encoding (and presumably also when decoding), because that is an appropriate type to use at intermediate stages.