Search code examples
haskellattoparsecaeson

Why do library designers use ByteString where Text seems to be appropriate?


Working on my app I've stumbled into a problem of Aeson not decoding UTF8 input. Digging deeper I found out that it relies on Parser ByteString of Attoparsec, which seems to be the source of the problem to me. But it's actually not what I'm asking here about.

The thing is it's not the only place I've seen people using ByteString where, as it seems obvious to me, only Text is appropriate, because JSON is not some binary file, it is a readable text and it may very well contain UTF8 characters.

So I am wondering whether I'm missing something and there are valid reasons to choose ByteString over Text or it is simply a widespread phenomenon of a bad library design caused by majority of people caring less about any other character sets than latin.


Solution

  • I think your problem is just a misunderstanding.

    Prelude> print "Ёжик лижет мёд."
    "\1025\1078\1080\1082 \1083\1080\1078\1077\1090 \1084\1105\1076."
    Prelude> putStrLn "\1025\1078\1080\1082 \1083\1080\1078\1077\1090 \1084\1105\1076."
    Ёжик лижет мёд.
    Prelude> "{\"a\": \"Ёжик лижет мёд.\"}"
    "{\"a\": \"\1025\1078\1080\1082 \1083\1080\1078\1077\1090 \1084\1105\1076.\"}"
    

    When you print a value containing a String, the Show instance for Char is used, and that escapes all characters with code points above 127. To get the glyphs you want, you need to putStr[Ln] the String.

    So aeson properly decoded the utf8-encoded input, as should be expected because it utf8-encodes the values itself:

    encode = {-# SCC "encode" #-} encodeUtf8 . toLazyText . fromValue .
             {-# SCC "toJSON" #-} toJSON
    

    So to the question why aeson uses ByteString and not Text for the final target of encoding and starting point of decoding.

    Because that is the appropriate type. The encoded values are intended to be transferred portably between machines. That happens as a stream of bytes (octets, if we're in pedantic mood). That is exactly what a ByteString provides, a sequence of bytes that then have to be treated in an application-specific way. For the purposes of aeson, the stream of bytes shall be encoded in utf-8, and aeson assumes the input of the decode function is valid utf-8, and encodes its output as valid utf-8.

    Transferring e.g. Text would run into portability problems, since a 16-bit encoding depends on endianness, so Text is not an appropriate format for interchange of data between machines. Note that aeson uses Text as an intermediate type when encoding (and presumably also when decoding), because that is an appropriate type to use at intermediate stages.