Search code examples
cperformancehaskellbytestring

Faster ByteString construction tips


I am new to Haskell, and I've stuck with efficiency issues..

Task is: Build CSV file from 4GB text file where columns have constant size

column sizes are known, for example [col1: 4 chars wide, col2: 2 chars wide, etc...
file can only contain [A-Z0-9] ASCII chars so there is no point in escaping cells

I have: 

$ cat example.txt 
AAAABBCCCC...
AAA1B1CCC1...
... (72 chars per line, usually 50 mln lines)


I need: 
$ cat done.csv
AAAA,BB,CCCC, ...
AAA1,B1,CCC1, ...
...

this is my fastest code in Haskell, takes about 2 minutes to process entire 4GB file.
I need at maximum 30 seconds

import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString as B
import qualified Data.ByteString.Unsafe as U
import Data.ByteString.Lazy.Builder
import Data.Monoid
import Data.List

col_sizes = intercalate [1] $ map (`replicate` 0) cs
  where
    cs = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]

sp = char8 ',' -- column separator
nl = char8 '\n'

separator !cs !cl !xs !xl !ci !xi
  | c  == 1   = ps
  | xi == xl  = mempty -- at the end of bytestring, end recursion
  | cl == ci  = pr
  | otherwise = pc
  where
    c  = U.unsafeIndex cs ci         -- get column separation indicator
    w  = word8 . U.unsafeIndex xs    -- get char from BS at position
    p  = separator cs cl xs xl       -- partial recursion call
    pr = nl   <> p  0       (xi + 1) -- end of row, put '\n', reset counter, recur
    ps = sp   <> p (ci + 1)  xi      -- end of column, put column separator, recur
    pc = w xi <> p (ci + 1) (xi + 1) -- in the middle of column, copy byte, recur


main = do
  contents <- B.getContents
  BL.putStr . toLazyByteString $ init_sep sp_after_char contents


init_sep cs xs = separator cs (l cs) xs (l xs) 0 0
  where l = fromIntegral . B.length

sp_after_char = B.pack col_sizes

And this is my implementation in C http://pastebin.com/Kjz3Mugs (to long to paste it here...)
takes about 5 seconds to process the same file

So my Haskell code is approx. 20x slower.

Because Haskell ByteString filter and map are faster than my implementations in C,
(both take less than 2s to process same file doing some simple modifications)
I hope there is something wrong in my Haskell code and I won't be forced to use C.

UPDATE: test data generator is available here http://pastebin.com/aJ3RW3jG

On production, data is piped from one binary to other so there is no hard drive IO

to test solutions I used SSD drive but I think that Ext4 cached that file anyway in RAM

time cat test.txt > /dev/null
cat test.txt > /dev/null  0,00s user 0,35s system 99% cpu 0,353 total

Vanilla generator:

time ./data_builder | head -50000000 > /dev/null
./data_builder  0,02s user 1,09s system 30% cpu 3,709 total
head -50000000 > /dev/null  2,95s user 0,76s system 99% cpu 3,708 total

my C solution:

time ./tocsvc < test.txt > /dev/null 
./tocsvc < test.txt > /dev/null  5,35s user 0,35s system 100% cpu 5,689 total

with generator

time ./data_builder | head -50000000 | ./tocsvc > /dev/null
./data_builder  0,02s user 1,18s system 18% cpu 6,460 total
head -50000000  3,15s user 1,19s system 67% cpu 6,459 total
./tocsvc > /dev/null  5,81s user 0,55s system 98% cpu 6,459 total

@GabrielGonzalez Haskell solution

time ./tocsvh1 < test.txt > /dev/null 
./tocsv < test.txt > /dev/null  19,56s user 0,41s system 100% cpu 19,950 total

with generator

time ./data_builder | head -50000000 | ./tocsvh1 > /dev/null 
./data_builder  0,11s user 3,04s system 7% cpu 41,320 total
head -50000000  7,29s user 3,56s system 26% cpu 41,319 total
./tocsvh2 > /dev/null  33,01s user 2,42s system 85% cpu 41,327 total

my Haskell solution

time ./tocsvh2 < test.txt > /dev/null 
./tocsvh2 < test.txt > /dev/null  128,63s user 2,95s system 100% cpu 2:11,45 total

with generator

time ./data_builder | head -50000000 | ./tocsvh2 > /dev/null 
./data_builder  0,02s user 1,26s system 28% cpu 4,526 total
head -50000000  3,17s user 1,33s system 99% cpu 4,524 total
./tocsvh2 > /dev/null  129,95s user 3,33s system 98% cpu 2:14,75 total

@LukeTaylor solution

time ./tocsvh3 < test.txt > /dev/null 
./tocsv < test.txt > /dev/null  324,38s user 4,13s system 100% cpu 5:28,18 total

with generator

time ./data_builder | head -50000000 | ./tocsvh3 > /dev/null 
./data_builder  0,43s user 4,46s system 1% cpu 5:30,34 total
head -50000000  5,20s user 2,82s system 2% cpu 5:30,34 total
./tocsv > /dev/null  329,08s user 4,21s system 100% cpu 5:32,96 total

Solution

  • I was able to get within a factor of 3 of C just by using raw pointer operations:

    import Control.Monad (unless, when, void)
    import Foreign.Safe hiding (void)
    import System.IO
    import Foreign.C.Types
    
    bufInSize :: Int
    bufInSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + 1
    
    bufOutSize :: Int
    bufOutSize = n * (1024 * 1024 `div` n) where n = sum sizes0 + length sizes0
    
    sizes0 :: [Int]
    sizes0 = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]
    
    -- I also tried using the C memset using the FFI, but got the same speed
    memcpy :: Ptr Word8 -> Ptr Word8 -> Int -> IO ()
    memcpy dst src n = when (n > 0) $ do
        x <- peek src
        poke dst x
        memcpy (dst `plusPtr` 1) (src `plusPtr` 1) (n - 1)
    
    main = do
        allocaArray bufInSize  $ \bufIn0  -> do
        allocaArray bufOutSize $ \bufOut0 -> do
        with (44 :: Word8)  $ \cm -> do
            let loop bufIn bufOut sizes suffixIn suffixOut = do
                    let (bytesIn, bytesOut, sizes', copy) = case sizes of
                            []     -> (1, 1    , sizes0,    memcpy bufOut bufIn 1)
                            [s]    -> (s, s    , []    ,    memcpy bufOut bufIn s)
                            s:izes -> (s, s + 1, izes  , do
                                memcpy  bufOut              bufIn s
                                memcpy (bufOut `plusPtr` s) cm    1 )
    
                    if suffixIn < bytesIn
                    then do
                        eof <- hIsEOF stdin
                        if eof
                            then hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)
                            else do
                                suffixIn' <- hGetBuf stdin bufIn0 bufInSize
                                loop bufIn0 bufOut sizes suffixIn' suffixOut
                    else if suffixOut < bytesOut
                    then do
                        hPutBuf stdout bufOut0 (bufOut `minusPtr` bufOut0)
                        loop bufIn bufOut0 sizes suffixIn bufOutSize
                    else do
                        copy
                        loop (bufIn  `plusPtr` bytesIn )
                             (bufOut `plusPtr` bytesOut)
                             sizes'
                             (suffixIn  - bytesIn )
                             (suffixOut - bytesOut)
            loop bufIn0 bufOut0 sizes0 0 bufOutSize
    

    Here are some rough time-based measurements using an input file of 1000000 lines:

    $ # The C Version
    $ time ./a.out < in.dat > out.dat
    
    real    0m0.189s
    user    0m0.116s
    sys 0m0.068s
    $ # The Haskell version
    $ time ./csv < in.dat > out2.dat
    
    real    0m0.536s
    user    0m0.428s
    sys 0m0.104s
    $ diff out.dat out2.dat
    $ # No difference