Search code examples
performancehaskellvectorbytestring

haskell reading a vector of pairs from bytestring very slow, how to make it faster?


I am trying to read a large vector of custom data type from a binary file. I tried to use the example given here.

The trouble with the example code is, it uses lists and I want to use vectors. So I adapted that code as below, but it takes very long time (more than a minute, I gave up after that) to read even 1 MB file.

module Main where

import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import qualified Data.Vector.Unboxed as Vec

main = do
  b <- BIN.readFile "dat.bin" -- about 1 MB size file
  let v = runGet getPairs (BIN.tail b) -- skip the first byte
  putStrLn $ show $ Vec.length v

getPair :: Get (Word8, Word8)
getPair = do
  price <- getWord8
  qty <- getWord8
  return (price, qty)

getPairs :: Get (Vec.Vector (Word8, Word8))
getPairs = do
 empty <- isEmpty
 if empty
   then return Vec.empty
   else do pair  <- getPair
           pairs <- getPairs
           return (Vec.cons pair pairs) -- is it slow because V.cons is O(n)?

When I tried to run it with ghc --make -O2 pairs.hs I got the error Stack space overflow: current size ...

How to efficiently read pairs of values from bytestring into vector?

Again, I wish to get complete working code not just only pointers to Haskell site or RWH nor a just list of function/module names.


Solution

  • Here are a couple of examples of creating Vectors from files. They are not the most efficient, but both run in just a couple of seconds in ghci.

    module Main where
    
    import qualified Data.ByteString.Lazy as BIN
    import qualified Data.ByteString as BS
    import qualified Data.Vector.Unboxed as Vec
    import System.IO
    import System.Posix
    
    getFileSize :: String -> IO Int
    getFileSize path = do
        stat <- getFileStatus path
        return (fromEnum $ fileSize stat)
    
    readVector1 path = do
      size <- getFileSize path
      withBinaryFile path ReadMode $ \h -> do
        -- can also use: size <- hFileSize h
        let go _ = do bs <- BS.hGet h 2 
                      return (BS.index bs 0, BS.index bs 1)
        Vec.generateM (div size 2) go
    
    pairs (a:b:rest) = (a,b) : pairs rest
    pairs _          = []
    
    readVector2 path = do
      contents <- BIN.readFile path
      -- unfoldr :: Unbox a => (b -> Maybe (a, b)) -> b -> Vector a
      let v = Vec.unfoldr go (pairs $ BIN.unpack contents)
            where go [] = Nothing
                  go (p:ps) = Just (p, ps)
      return v
    
    main = do
      v <- readVector1 "rand" -- large file
      print $ Vec.length v
      v <- readVector2 "rand"
      print $ Vec.length v
    

    A third alternative:

    readVector3 path = do
      contents <- BS.readFile path
      let size = BS.length contents
          v = Vec.generate (div (fromIntegral size) 2) go
                where go i = let a = BS.index contents (2*i)
                                 b = BS.index contents (2*i+1)
                             in (a,b)
      return v
    

    This one turns out to be the fastest of the three.