I am trying to read a large vector of custom data type from a binary file. I tried to use the example given here.
The trouble with the example code is, it uses lists and I want to use vectors. So I adapted that code as below, but it takes very long time (more than a minute, I gave up after that) to read even 1 MB file.
module Main where
import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import qualified Data.Vector.Unboxed as Vec
main = do
b <- BIN.readFile "dat.bin" -- about 1 MB size file
let v = runGet getPairs (BIN.tail b) -- skip the first byte
putStrLn $ show $ Vec.length v
getPair :: Get (Word8, Word8)
getPair = do
price <- getWord8
qty <- getWord8
return (price, qty)
getPairs :: Get (Vec.Vector (Word8, Word8))
getPairs = do
empty <- isEmpty
if empty
then return Vec.empty
else do pair <- getPair
pairs <- getPairs
return (Vec.cons pair pairs) -- is it slow because V.cons is O(n)?
When I tried to run it with ghc --make -O2 pairs.hs
I got the error Stack space overflow: current size ...
How to efficiently read pairs of values from bytestring into vector?
Again, I wish to get complete working code not just only pointers to Haskell site or RWH nor a just list of function/module names.
Here are a couple of examples of creating Vectors from files. They are not the most efficient, but both run in just a couple of seconds in ghci.
module Main where
import qualified Data.ByteString.Lazy as BIN
import qualified Data.ByteString as BS
import qualified Data.Vector.Unboxed as Vec
import System.IO
import System.Posix
getFileSize :: String -> IO Int
getFileSize path = do
stat <- getFileStatus path
return (fromEnum $ fileSize stat)
readVector1 path = do
size <- getFileSize path
withBinaryFile path ReadMode $ \h -> do
-- can also use: size <- hFileSize h
let go _ = do bs <- BS.hGet h 2
return (BS.index bs 0, BS.index bs 1)
Vec.generateM (div size 2) go
pairs (a:b:rest) = (a,b) : pairs rest
pairs _ = []
readVector2 path = do
contents <- BIN.readFile path
-- unfoldr :: Unbox a => (b -> Maybe (a, b)) -> b -> Vector a
let v = Vec.unfoldr go (pairs $ BIN.unpack contents)
where go [] = Nothing
go (p:ps) = Just (p, ps)
return v
main = do
v <- readVector1 "rand" -- large file
print $ Vec.length v
v <- readVector2 "rand"
print $ Vec.length v
A third alternative:
readVector3 path = do
contents <- BS.readFile path
let size = BS.length contents
v = Vec.generate (div (fromIntegral size) 2) go
where go i = let a = BS.index contents (2*i)
b = BS.index contents (2*i+1)
in (a,b)
return v
This one turns out to be the fastest of the three.