Search code examples
haskelltextutf-8ghcbytestring

Get a `Char` from a `ByteString`


Is there a way to get the first UTF-8 Char in a ByteString in O(1) time? I'm looking for something like

headUtf8 :: ByteString -> Char
tailUtf8 :: ByteString -> ByteString

I'm not yet constrained to use strict or lazy ByteString, but I'd prefer strict. For lazy ByteString, I can cobble something together via Text, but I'm not sure how efficient (especially space-complexity wise) this is.

import qualified Data.Text.Lazy as T
import Data.Text.Lazy.Encoding (decodeUtf8With, encodeUtf8)
import Data.Text.Encoding.Error (lenientDecode)

headUtf8 :: ByteString -> Char
headUtf8 = T.head . decodeUtf8With lenientDecode

tailUtf8 :: ByteString -> ByteString
tailUtf8 = encodeUtf8 . T.tail . decodeUtf8With lenientDecode

In case anyone is interested, this problem arises when using Alex to make a lexer that supports UTF-8 characters1.


1 I am aware that since Alex 3.0 you only need to provide alexGetByte (and that is great!) but I still need to be able to get characters in other code in the lexer.


Solution

  • You want the Data.Bytestring.UTF8 module in the utf8-string package. It contains an uncons function with the following signature:

    uncons :: ByteString -> Maybe (Char, ByteString)
    

    You can then define:

    headUtf8 :: ByteString -> Char
    headUtf8 = fst . fromJust . uncons
    
    tailUtf8 :: ByteString -> ByteString
    tailUtf8 = snd . fromJust . uncons