Search code examples
haskellbytestring

ByteString assumes ISO-8859-1?


The documentation for Data.ByteString.hGetContents says

As with hGet, the string representation in the file is assumed to be ISO-8859-1.

Why should it have to "assume" anything about the "string representation in the file"? The data is not necessarily strings or encoded text at all. If I wanted something to deal with encoded text I'd use Data.Text or perhaps Data.ByteString.Char8. I thought the whole point of ByteString is that the data is handled as a list of 8-bit bytes, not as text characters. What is the impact of the assumption that it is ISO-8859-1?


Solution

  • It's a roundabout way to say the same thing - no decoding is performed (since the encoding is 8-bit, nothing needs to be done), so hGetContents gives you bytes in range 0x00 - 0xFF:

    $ cat utf-8.txt
    ÇÈÄ
    $ iconv -f iso8859-1 iso8859-1.txt                         
    ÇÈÄ
    $ ghci
    > openFile "iso8859-1.txt" ReadMode >>= (\h -> fmap BS.unpack $ BS.hGetContents h)
    [199,200,196,10]
    > openFile "utf-8.txt" ReadMode >>= (\h -> fmap BS.unpack $ BS.hGetContents h)
    [195,135,195,136,195,132,10]