With the following code, I want to serialize a Data.Text value to a ByteString. Unfortunately my text is prepended with unnecessary NUL bytes and an EOT byte:
GHCi, version 9.4.4: https://www.haskell.org/ghc/ :? for help
ghci> import qualified Data.Text as T
ghci> import Data.Binary
ghci> import Data.Binary.Put
ghci> let txt = T.pack "Text"
ghci> runPut $ put txt
"\NUL\NUL\NUL\NUL\NUL\NUL\NUL\EOTText"
ghci>
Questions:
PS: I the real code I put the length in front of the text
foo :: Text -> ByteString
foo txt = runPut do
putWord32host $ T.length txt
put txt
It actually already encodes the length in the binary string. Indeed, if we look at the source code, for the Text
instance of Binary
, we see [src]:
instance Binary Text where put t = put (encodeUtf8 t) get = do bs <- get case decodeUtf8' bs of P.Left exn -> P.fail (P.show exn) P.Right a -> P.return a
That's not much of a surprise, we encode it to UTF-8 which produces a ByteString
, and then use put
on that one. But the length is added when we put
the ByteString
itself. Indeed, the BinaryString
instance of Binary
looks like [src]:
instance Binary B.ByteString where put bs = put (B.length bs) <> putByteString bs get = get >>= getByteString
The put
for the ByteString
produced by encodeUtf8
thus writes eight bytes to specify the size of the ByteString
, this is thus the number of bytes, not (per se the same as) the number of characters in the Text
.
If you would want the same effect, but without the length prefix, you can use:
import Data.Text.Encoding
runPut (putByteString (encodeUtf8 txt))
this thus omits the length header.