Hello i am a bit confused about all the Haskell modules needed for encoding data from String
to ByteString
for efficient writing.
I do not understand how you convert a Data.ByteString.Lazy
to a Data.ByteString.Char8
and vice versa.
What do i need to know ? Because i can't get all these possible combinations of usages ....
Data.ByteString
,Data.ByteString.Lazy
,Data.ByteString.Char8
, then there's Data.Text
.....what do i need in order to write strings to files easily and efficiently and vice -versa? (with proper encoding)
P.S Currently reading Real World Haskell and i got pretty confused on all these modules.
Here's a shot at a roadmap.
As you are probably aware, the Haskell String
type is just a type synonym for [Char]
, where Char
is a data type that can represent a single Unicode code point. This makes String
the perfect data type to represent textual data, except for the minor issue that -- as a linked list of boxed Char
values -- it has the potential to be extremely inefficient.
The Text
data type from the text
package addresses this issue. Text
is also, like String
, a representation of a list of Char
values, but instead of using an actual Haskell list, it uses a time and space-efficient representation. It should be your go-to replacement for String
whenever you need to work efficiently with textual (Unicode) data.
Like many other data types in the standard Haskell libraries, it comes in lazy and strict variants. Both variants have the same name Text
, but they are contained in separate modules, so you might do:
import qualified Data.Text as TS
import qualified Data.Text.Lazy as TL
if you needed to use both TS.Text
and TL.Text
variants in the same program.
The exact difference between variants is described in the documentation for Data.Text. In a nutshell, you should default to using the strict version. You only use the lazy version in two cases. First, if you're planning to work on a large Text
value a little bit at a time, treating it more like a text "stream" than a "string", then the lazy version is a good choice. (For example, a program to read a huge CSV file of numbers might read the file as a long lazy Text
stream and store the results in an efficient numeric type like a Vector
of unboxed Double
values to avoid keeping the whole input text in memory.) Second, if you're building a large Text
string up from lots of little pieces, then you don't want to use the strict versions, because their immutability means they need to be copied whenever you add something. Instead, you'd want to use the lazy variant with functions from Data.Text.Lazy.Builder
.
The ByteString
data type from the bytestring
package, on the other hand, is an efficient representation of a list of bytes. Just like Text
is an efficient version of [Char]
, you should think of ByteString
as an efficient version of [Word8]
, where Word8
is the Haskell type representing a single unsigned byte of data with value 0-255. Equivalently, you can think of a ByteString
as representing a chunk of memory or a chunk of data to be read from or written to a file, precisely as-is byte for byte. It also comes in lazy and strict flavors:
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BL
and the considerations for using the variants are similar to those for Text
.
In a Haskell program, it's usual to represent Unicode strings internally as either String
or Text
values. However, to read them in from or write them out to files, they need to be encoded into and decoded from sequences of bytes.
The simplest way of dealing with this is to use Haskell functions that handle the encoding and decoding automatically. As you are probably aware, there are already two functions in the Prelude
that read and write strings directly:
readFile :: FilePath -> IO String
writeFile :: FilePath -> String -> IO ()
In addition, there are readFile
and writeFile
functions in text
that do this. You can find versions in both Data.Text.IO
and Data.Text.Lazy.IO
. They appear to have the same signatures, but one is operating on the strict Text
type and the other is operating on the lazy Text
type:
readFile :: FilePath -> IO Text
writeFile :: FilePath -> Text -> IO ()
You can tell these functions are doing the encoding and decoding automatically because they return and accept Text
values, not ByteString
values. The default encoding used will depend on the operating system and its configuration. On a typical modern Linux distribution, it'll be UTF-8.
Alternatively, you can read or write the raw bytes from the file using functions from the bytestring
package (again, either lazy or strict versions, depending on the module):
readFile :: FilePath -> IO ByteString
writeFile :: FilePath -> ByteString -> IO ()
These have the same names as the text
versions, but you can see they are dealing with raw bytes because they return and accept ByteString
arguments. In this case, if you want to use these ByteString
s as text data, you'll need to decode or encode them yourself. If the ByteString
represents a UTF-8 encoded version of the text for example, then these functions from Data.Text.Encoding
(for strict versions) or Data.Text.Lazy.Encoding
(for lazy versions) are what you're looking for:
decodeUtf8 :: ByteString -> Text
encodeUtf8 :: Text -> ByteString
Now, the modules in Data.ByteString.Char8
and Data.ByteString.Lazy.Char8
are a special case. When plain ASCII text has been encoded using one of several "ASCII-preserving" encoding schemes (including ASCII itself, Latin-1 and other Latin-x encodings, and UTF-8), it turns out that the encoded ByteString
is just a simple one-byte-per-character encoding of Unicode code points 0 to 127. Slightly more generally, when text has been encoded in Latin-1, then the encoded ByteString
is just a simple one-byte-per-character encoding of Unicode code points 0 to 255. In these cases, and in these cases only, the functions in these modules can be safely used to bypass the explicit encoding and decoding steps and just treat the byte string as ASCII and/or Latin-1 text directly by automaticaly converting single bytes to unicode Char
values and back.
Because these functions only work in that special case, you should generally avoid using them except in specialized applications.
Also, as was mentioned in a comment, the ByteString
variants in these Char8
modules are not any different than the plain strict and lazy ByteString
variants; the are just treated as if they are strings of Char
values instead of Word8
values by the functions in those modules -- the data types are the same, just the function interface is different.
So, if you're working with plain text and your operating system's default coding, just use the strict Text
data type from Data.Text
and the (highly efficient) IO functions from Data.Text.IO
. You can use the lazy variants for stream processing or building big strings from tiny pieces, and you can use Data.Text.Read
for some simple parsing.
You should be able to avoid using String
at all in most situations, but if you find you need to convert back and forth, then these conversion functions in Data.Text
(or Data.Text.Lazy
) will be useful:
pack :: String -> Text
unpack :: Text -> String
If you need more control over the encoding, you still want to use Text
throughout your program, except at the "edges" where you're reading or writing files. At those edges, use the I/O functions from Data.ByteString
(or Data.ByteString.Lazy
), and the encoding/decoding functions from Data.Text.Encoding
or Data.Text.Lazy.Encoding
.
If you find you need to mix strict and lazy variants, note that Data.Text.Lazy
contains:
toStrict :: TL.Text -> TS.Text -- convert lazy to strict
fromStrict :: TS.Text -> TL.Text -- vice versa
and Data.ByteString.Lazy
contains the corresponding functions for ByteString
values:
toStrict :: BL.ByteString -> BS.ByteString
fromStrict :: BS.ByteString -> BL.ByteString