I am new to Haskell, and I've stuck with efficiency issues..
Task is: Build CSV file from 4GB text file where columns have constant size
column sizes are known, for example [col1: 4 chars wide, col2: 2 chars wide, etc...
file can only contain [A-Z0-9] ASCII chars so there is no point in escaping cells
I have:
$ cat example.txt
AAAABBCCCC...
AAA1B1CCC1...
... (72 chars per line, usually 50 mln lines)
I need:
$ cat done.csv
AAAA,BB,CCCC, ...
AAA1,B1,CCC1, ...
...
this is my fastest code in Haskell,
takes about 2 minutes to process entire 4GB file.
I need at maximum 30 seconds
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString as B
import qualified Data.ByteString.Unsafe as U
import Data.ByteString.Lazy.Builder
import Data.Monoid
import Data.List
col_sizes = intercalate [1] $ map (`replicate` 0) cs
where
cs = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]
sp = char8 ',' -- column separator
nl = char8 '\n'
separator !cs !cl !xs !xl !ci !xi
| c == 1 = ps
| xi == xl = mempty -- at the end of bytestring, end recursion
| cl == ci = pr
| otherwise = pc
where
c = U.unsafeIndex cs ci -- get column separation indicator
w = word8 . U.unsafeIndex xs -- get char from BS at position
p = separator cs cl xs xl -- partial recursion call
pr = nl <> p 0 (xi + 1) -- end of row, put '\n', reset counter, recur
ps = sp <> p (ci + 1) xi -- end of column, put column separator, recur
pc = w xi <> p (ci + 1) (xi + 1) -- in the middle of column, copy byte, recur
main = do
contents <- B.getContents
BL.putStr . toLazyByteString $ init_sep sp_after_char contents
init_sep cs xs = separator cs (l cs) xs (l xs) 0 0
where l = fromIntegral . B.length
sp_after_char = B.pack col_sizes
And this is my implementation in C http://pastebin.com/Kjz3Mugs
(to long to paste it here...)
takes about 5 seconds to process the same file
So my Haskell code is approx. 20x slower.
Because Haskell ByteString filter and map are faster than my implementations in C,
(both take less than 2s to process same file doing some simple modifications)
I hope there is something wrong in my Haskell code and I won't be forced to use C.
time cat test.txt > /dev/null
cat test.txt > /dev/null 0,00s user 0,35s system 99% cpu 0,353 total
Vanilla generator:
time ./data_builder | head -50000000 > /dev/null
./data_builder 0,02s user 1,09s system 30% cpu 3,709 total
head -50000000 > /dev/null 2,95s user 0,76s system 99% cpu 3,708 total
time ./tocsvc < test.txt > /dev/null
./tocsvc < test.txt > /dev/null 5,35s user 0,35s system 100% cpu 5,689 total
with generator
time ./data_builder | head -50000000 | ./tocsvc > /dev/null
./data_builder 0,02s user 1,18s system 18% cpu 6,460 total
head -50000000 3,15s user 1,19s system 67% cpu 6,459 total
./tocsvc > /dev/null 5,81s user 0,55s system 98% cpu 6,459 total
time ./tocsvh1 < test.txt > /dev/null
./tocsv < test.txt > /dev/null 19,56s user 0,41s system 100% cpu 19,950 total
with generator
time ./data_builder | head -50000000 | ./tocsvh1 > /dev/null
./data_builder 0,11s user 3,04s system 7% cpu 41,320 total
head -50000000 7,29s user 3,56s system 26% cpu 41,319 total
./tocsvh2 > /dev/null 33,01s user 2,42s system 85% cpu 41,327 total
time ./tocsvh2 < test.txt > /dev/null
./tocsvh2 < test.txt > /dev/null 128,63s user 2,95s system 100% cpu 2:11,45 total
with generator
time ./data_builder | head -50000000 | ./tocsvh2 > /dev/null
./data_builder 0,02s user 1,26s system 28% cpu 4,526 total
head -50000000 3,17s user 1,33s system 99% cpu 4,524 total
./tocsvh2 > /dev/null 129,95s user 3,33s system 98% cpu 2:14,75 total
time ./tocsvh3 < test.txt > /dev/null
./tocsv < test.txt > /dev/null 324,38s user 4,13s system 100% cpu 5:28,18 total
with generator
time ./data_builder | head -50000000 | ./tocsvh3 > /dev/null
./data_builder 0,43s user 4,46s system 1% cpu 5:30,34 total
head -50000000 5,20s user 2,82s system 2% cpu 5:30,34 total
./tocsv > /dev/null 329,08s user 4,21s system 100% cpu 5:32,96 total
I was able to get within a factor of 3 of C just by using raw pointer operations:
import Control.Monad (unless, when, void)
import Foreign.Safe hiding (void)
import System.IO
import Foreign.C.Types
bufInSize :: Int
bufInSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + 1
bufOutSize :: Int
bufOutSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + length sizes0
sizes0 :: [Int]
sizes0 = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]
-- I also tried using the C memset using the FFI, but got the same speed
memcpy :: Ptr Word8 -> Ptr Word8 -> Int -> IO ()
memcpy dst src n = when (n > 0) $ do
x <- peek src
poke dst x
memcpy (dst `plusPtr` 1) (src `plusPtr` 1) (n - 1)
main = do
allocaArray bufInSize $ \bufIn0 -> do
allocaArray bufOutSize $ \bufOut0 -> do
with (44 :: Word8) $ \cm -> do
let loop bufIn bufOut sizes suffixIn suffixOut = do
let (bytesIn, bytesOut, sizes', copy) = case sizes of
[] -> (1, 1 , sizes0, memcpy bufOut bufIn 1)
[s] -> (s, s , [] , memcpy bufOut bufIn s)
s:izes -> (s, s + 1, izes , do
memcpy bufOut bufIn s
memcpy (bufOut `plusPtr` s) cm 1 )
if suffixIn < bytesIn
then do
eof <- hIsEOF stdin
if eof
then hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)
else do
suffixIn' <- hGetBuf stdin bufIn0 bufInSize
loop bufIn0 bufOut sizes suffixIn' suffixOut
else if suffixOut < bytesOut
then do
hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)
loop bufIn bufOut0 sizes suffixIn bufOutSize
else do
copy
loop (bufIn `plusPtr` bytesIn )
(bufOut `plusPtr` bytesOut)
sizes'
(suffixIn - bytesIn )
(suffixOut - bytesOut)
loop bufIn0 bufOut0 sizes0 0 bufOutSize
Here are some rough time
-based measurements using an input file of 1000000 lines:
$ # The C Version
$ time ./a.out < in.dat > out.dat
real 0m0.189s
user 0m0.116s
sys 0m0.068s
$ # The Haskell version
$ time ./csv < in.dat > out2.dat
real 0m0.536s
user 0m0.428s
sys 0m0.104s
$ diff out.dat out2.dat
$ # No difference