Search code examples
haskellcharacter-encodingbyte-order-markbytestring

Simplest way to remove BOM from Haskell ByteString


I have LazyByteString which possibly starts with BOM. What is the easiest and preferable efficient way to remove BOM from this ByteString?


Solution

  • I feel like I must be misunderstanding the problem. Doesn't this boil down to checking the first three bytes of a bytestring and conditionally dropping those bytes?

    • To get the first 3 bytes use take.
    • To check bytestring equality use (==).
    • To drop the first 3 bytes use drop.

    Putting these together we get:

    import Data.ByteString.Lazy as BS
    dropBOM bs | BS.take 3 bs == BS.pack [0xEF,0xBB,0xBF] = BS.drop 3 bs
               | otherwise = bs
    

    However, even after dealing with lots of utf8 I never felt as though I needed to explicitly deal with BOM thanks to packages like Text that provide most the desired operations. Perhaps you can solve your problem in another way than manually munging the bytestring.