Search code examples
haskellencodingbytestring

Encoding and efficient IO in Haskell


Hello i am a bit confused about all the Haskell modules needed for encoding data from String to ByteString for efficient writing.

I do not understand how you convert a Data.ByteString.Lazy to a Data.ByteString.Char8 and vice versa.

What do i need to know ? Because i can't get all these possible combinations of usages .... Data.ByteString,Data.ByteString.Lazy,Data.ByteString.Char8 , then there's Data.Text.....what do i need in order to write strings to files easily and efficiently and vice -versa? (with proper encoding)

P.S Currently reading Real World Haskell and i got pretty confused on all these modules.


Solution

  • Here's a shot at a roadmap.

    Strings and Text

    As you are probably aware, the Haskell String type is just a type synonym for [Char], where Char is a data type that can represent a single Unicode code point. This makes String the perfect data type to represent textual data, except for the minor issue that -- as a linked list of boxed Char values -- it has the potential to be extremely inefficient.

    The Text data type from the text package addresses this issue. Text is also, like String, a representation of a list of Char values, but instead of using an actual Haskell list, it uses a time and space-efficient representation. It should be your go-to replacement for String whenever you need to work efficiently with textual (Unicode) data.

    Like many other data types in the standard Haskell libraries, it comes in lazy and strict variants. Both variants have the same name Text, but they are contained in separate modules, so you might do:

    import qualified Data.Text as TS
    import qualified Data.Text.Lazy as TL
    

    if you needed to use both TS.Text and TL.Text variants in the same program.

    The exact difference between variants is described in the documentation for Data.Text. In a nutshell, you should default to using the strict version. You only use the lazy version in two cases. First, if you're planning to work on a large Text value a little bit at a time, treating it more like a text "stream" than a "string", then the lazy version is a good choice. (For example, a program to read a huge CSV file of numbers might read the file as a long lazy Text stream and store the results in an efficient numeric type like a Vector of unboxed Double values to avoid keeping the whole input text in memory.) Second, if you're building a large Text string up from lots of little pieces, then you don't want to use the strict versions, because their immutability means they need to be copied whenever you add something. Instead, you'd want to use the lazy variant with functions from Data.Text.Lazy.Builder.

    ByteStrings

    The ByteString data type from the bytestring package, on the other hand, is an efficient representation of a list of bytes. Just like Text is an efficient version of [Char], you should think of ByteString as an efficient version of [Word8], where Word8 is the Haskell type representing a single unsigned byte of data with value 0-255. Equivalently, you can think of a ByteString as representing a chunk of memory or a chunk of data to be read from or written to a file, precisely as-is byte for byte. It also comes in lazy and strict flavors:

    import qualified Data.ByteString as BS
    import qualified Data.ByteString.Lazy as BL
    

    and the considerations for using the variants are similar to those for Text.

    Reading and Writing to Files

    In a Haskell program, it's usual to represent Unicode strings internally as either String or Text values. However, to read them in from or write them out to files, they need to be encoded into and decoded from sequences of bytes.

    The simplest way of dealing with this is to use Haskell functions that handle the encoding and decoding automatically. As you are probably aware, there are already two functions in the Prelude that read and write strings directly:

    readFile :: FilePath -> IO String
    writeFile :: FilePath -> String -> IO ()
    

    In addition, there are readFile and writeFile functions in text that do this. You can find versions in both Data.Text.IO and Data.Text.Lazy.IO. They appear to have the same signatures, but one is operating on the strict Text type and the other is operating on the lazy Text type:

    readFile :: FilePath -> IO Text
    writeFile :: FilePath -> Text -> IO ()
    

    You can tell these functions are doing the encoding and decoding automatically because they return and accept Text values, not ByteString values. The default encoding used will depend on the operating system and its configuration. On a typical modern Linux distribution, it'll be UTF-8.

    Alternatively, you can read or write the raw bytes from the file using functions from the bytestring package (again, either lazy or strict versions, depending on the module):

    readFile :: FilePath -> IO ByteString
    writeFile :: FilePath -> ByteString -> IO ()
    

    These have the same names as the text versions, but you can see they are dealing with raw bytes because they return and accept ByteString arguments. In this case, if you want to use these ByteStrings as text data, you'll need to decode or encode them yourself. If the ByteString represents a UTF-8 encoded version of the text for example, then these functions from Data.Text.Encoding (for strict versions) or Data.Text.Lazy.Encoding (for lazy versions) are what you're looking for:

    decodeUtf8 :: ByteString -> Text
    encodeUtf8 :: Text -> ByteString
    

    The Char8 Modules

    Now, the modules in Data.ByteString.Char8 and Data.ByteString.Lazy.Char8 are a special case. When plain ASCII text has been encoded using one of several "ASCII-preserving" encoding schemes (including ASCII itself, Latin-1 and other Latin-x encodings, and UTF-8), it turns out that the encoded ByteString is just a simple one-byte-per-character encoding of Unicode code points 0 to 127. Slightly more generally, when text has been encoded in Latin-1, then the encoded ByteString is just a simple one-byte-per-character encoding of Unicode code points 0 to 255. In these cases, and in these cases only, the functions in these modules can be safely used to bypass the explicit encoding and decoding steps and just treat the byte string as ASCII and/or Latin-1 text directly by automaticaly converting single bytes to unicode Char values and back.

    Because these functions only work in that special case, you should generally avoid using them except in specialized applications.

    Also, as was mentioned in a comment, the ByteString variants in these Char8 modules are not any different than the plain strict and lazy ByteString variants; the are just treated as if they are strings of Char values instead of Word8 values by the functions in those modules -- the data types are the same, just the function interface is different.

    General Strategy

    So, if you're working with plain text and your operating system's default coding, just use the strict Text data type from Data.Text and the (highly efficient) IO functions from Data.Text.IO. You can use the lazy variants for stream processing or building big strings from tiny pieces, and you can use Data.Text.Read for some simple parsing.

    You should be able to avoid using String at all in most situations, but if you find you need to convert back and forth, then these conversion functions in Data.Text (or Data.Text.Lazy) will be useful:

    pack :: String -> Text
    unpack :: Text -> String
    

    If you need more control over the encoding, you still want to use Text throughout your program, except at the "edges" where you're reading or writing files. At those edges, use the I/O functions from Data.ByteString (or Data.ByteString.Lazy), and the encoding/decoding functions from Data.Text.Encoding or Data.Text.Lazy.Encoding.

    If you find you need to mix strict and lazy variants, note that Data.Text.Lazy contains:

    toStrict :: TL.Text -> TS.Text     -- convert lazy to strict
    fromStrict :: TS.Text -> TL.Text   -- vice versa
    

    and Data.ByteString.Lazy contains the corresponding functions for ByteString values:

    toStrict :: BL.ByteString -> BS.ByteString
    fromStrict :: BS.ByteString -> BL.ByteString