Search code examples
haskellbytestring

Serializing a Data.Text value to a ByteString without unnecessary \NUL bytes


With the following code, I want to serialize a Data.Text value to a ByteString. Unfortunately my text is prepended with unnecessary NUL bytes and an EOT byte:

GHCi, version 9.4.4: https://www.haskell.org/ghc/  :? for help
ghci> import qualified Data.Text as T
ghci> import Data.Binary
ghci> import Data.Binary.Put
ghci> let txt = T.pack "Text"
ghci> runPut $ put txt
"\NUL\NUL\NUL\NUL\NUL\NUL\NUL\EOTText"
ghci>

Questions:

  • Why are these NUL and EOT bytes generated?
  • How can I avoid them in the resulting ByteString?

PS: I the real code I put the length in front of the text

    foo :: Text -> ByteString
    foo txt = runPut do
        putWord32host $ T.length txt
        put txt

Solution

  • It actually already encodes the length in the binary string. Indeed, if we look at the source code, for the Text instance of Binary, we see [src]:

    instance Binary Text where
        put t = put (encodeUtf8 t)
        get   = do
          bs <- get
          case decodeUtf8' bs of
            P.Left exn -> P.fail (P.show exn)
            P.Right a -> P.return a

    That's not much of a surprise, we encode it to UTF-8 which produces a ByteString, and then use put on that one. But the length is added when we put the ByteString itself. Indeed, the BinaryString instance of Binary looks like [src]:

    instance Binary B.ByteString where
        put bs = put (B.length bs)
                 <> putByteString bs
        get    = get >>= getByteString

    The put for the ByteString produced by encodeUtf8 thus writes eight bytes to specify the size of the ByteString, this is thus the number of bytes, not (per se the same as) the number of characters in the Text.

    If you would want the same effect, but without the length prefix, you can use:

    import Data.Text.Encoding
    
    runPut (putByteString (encodeUtf8 txt))
    

    this thus omits the length header.