Search code examples
performancehaskellbinarybytestring

how to improve this very slow and inefficient Haskell program to process binary files byte by byte?


I am trying to write a hexdump like program in Haskell. I wrote the following program, I am glad that it works and gives desired output but it is very slow and inefficient. It was adapted from the program given in this answer.

I ran the program with a sample file, and it takes about 1 minute to process that less than 1MB file. The standard Linux hexdump program does the job in less about a second. All I want to do in the program is read->process->write all individual bytes in a bytestring.

Here is the question - How to efficiently read/process/write the bytestring (byte by byte, i.e. without using any other functions like getWord32le, if that's what is needed)? I want to do arithmetical and logical operations on each individual byte not necessarily on the Word32le or a group of bytes like that. I didn't find any data type like Byte.

Anyway, here is the code I wrote, which runs successfully on ghci (version 7.4) -

module Main where

import Data.Time.Clock
import Data.Char
import qualified Data.ByteString.Lazy as BIN
import Data.ByteString.Lazy.Char8
import Data.Binary.Get
import Data.Binary.Put
import System.IO
import Numeric (showHex, showIntAtBase)

main = do
  let infile = "rose_rosebud_flower.jpg"
  let outfile = "rose_rosebud_flower.hex"
  h_in  <- openFile infile ReadMode
  System.IO.putStrLn "before time: "
  t1 <- getCurrentTime >>= return . utctDayTime
  System.IO.putStrLn $ (show t1)
  process_file h_in outfile
  System.IO.putStrLn "after time: "
  t2 <- getCurrentTime >>= return . utctDayTime
  System.IO.putStrLn $ (show t2)
  hClose h_in

process_file h_in outfile = do 
  eof <- hIsEOF h_in
  if eof 
      then return ()
      else do  bin1 <- BIN.hGet h_in 1
               let str = (Data.ByteString.Lazy.Char8.unpack) bin1
               let hexchar = getHex str
               System.IO.appendFile outfile hexchar
               process_file h_in outfile

getHex (b:[]) = (tohex $ ord b) ++ " " 
getHex _ = "ERR "

tohex d = showHex d ""

When I run it on the ghci I get

*Main> main
before time: 
23254.13701s
after time: 
23313.381806s

Please provide a modified (but complete working) code as answer and not just the list of names of some functions. Also, don't provide solutions that use jpeg or other image processing libraries as I am not interested in image processing. I used the jpeg image as example non-text file. I just want to process data byte by byte. Also don't provide links to other sites (especially to the documentation (or the lack of it) on the Haskell site). I cannot understand the documentation for bytestring and for many other packages written on the Haskell site, their documentation (which is just type signatures collected on a page, in most cases) seems only meant for the experts, who already understand most of the stuff. If I could figure out the solution by reading their documentation or even the much advertised (real world haskell) RWH book, I'd not have asked this question in the first place.

Sorry for the seeming rant, but the experience with Haskell is frustrating as compared to many other languages, especially when it comes to doing even simple IO as the Haskell IO related documentation with small complete working examples is almost absent.


Solution

  • Your example code reads one byte at a time. That's pretty much guaranteed to be slow. Better still, it reads a 1-byte ByteString and then immediately converts it to a list, negating all the benefits of ByteString. Best of all, it writes to the output file by the slightly strange method of opening the file, appending a single character, and then closing the file again. So for every individual hex character written out, the file has to be completely opened, wound to the end, have a character appended, and then flushed to disk and closed again.

    I'm not 100% sure what you're trying to achieve here (i.e., trying to learn how stuff works vs trying to make a specific program work), so I'm not sure exactly how best to answer your question.

    If this is your very first foray into Haskell, starting with something I/O-centric is probably a bad idea. You would be better off learning the rest of the language before worrying about how to do high-performance I/O. That said, let me try to answer your actual question...

    First, there is no type named "byte". The type you're looking for is called Word8 (if you want an unsigned 8-bit integer) or Int8 (if you want a signed 8-bit integer — which you probably don't). There are also types like Word16, Word32, Word64; you need to import Data.Word to get them. Similarly, Int16, Int32 and Int64 live in Data.Int. The Int and Integer types are automatically imported, so you don't need to do anything special for those.

    A ByteString is basically an array of bytes. A [Word8], on the other hand, is a single-linked list of individual bytes which may or may not be computed yet — much less efficient, but far more flexible.

    If literally all you want to do is apply a transformation to every single byte, independent of any other byte, then the ByteString package provides a map function which will do exactly that:

    map :: (Word8 -> Word8) -> ByteString -> ByteString
    

    If you just want to read from one file and write to another, you can do that using so-called "lazy I/O". This is a neat dodge where the library handles all the I/O chunking for you. It has a few nasty gotchas though; basically revolving around it being hard to know exactly when the input file will get closed. For simple cases, that doesn't matter. For more complicated cases, it does.

    So how does it work? Well, the ByteString library has a function

    readFile :: FilePath -> IO ByteString
    

    It looks like it reads the entire file into a giant ByteString in memory. But it doesn't. It's a trick. Actually it just checks that the file exists, and opens it for reading. When you try to use the ByteString, in the background the file invisibly gets read into memory as you process it. That means you can do something like this:

    main = do
      bin <- readFile "in_file"
      writeFile "out_file" (map my_function bin)
    

    This will read in_file, apply my_function to every individual byte of the file, and save the result into out_file, automatically doing I/O in large enough chunks to give good performance, but never holding more than one chunk in RAM at once. (The my_function part must have type Word8 -> Word8.) So this is both very simple to write, and should be extremely fast.

    Things get fun if you don't want to read the entire file, or want to access the file in random order, or anything complicated like that. I am told that the pipes library is the thing to look at, but personally I've never used it.

    In the interests of a complete working example:

    module Main where
    
    import Data.Word
    import qualified Data.ByteString.Lazy as BIN
    import Numeric
    
    main = do
      bin <- BIN.readFile "in_file"
      BIN.writeFile "out_file" (BIN.concatMap my_function bin)
    
    my_function :: Word8 -> BIN.ByteString
    my_function b =
      case showHex b "" of
        c1:c2:_ -> BIN.pack [fromIntegral $ fromEnum $ c1 , fromIntegral $ fromEnum $ c2]   -- Get first two chars in hex string, convert Char to Word8.
        c2:_    -> BIN.pack [fromIntegral $ fromEnum $ '0', fromIntegral $ fromEnum $ c2]   -- Only one digit. Assume first digit is zeor.
    

    Note that because one byte becomes two hex digits, I've used the ByteString version of concatMap, which allows my_function to return a whole ByteString rather than just a single byte.