Search code examples
haskellbytestring

Frequency of characters


I am trying to find frequency of characters in file using Haskell. I want to be able to handle files ~500MB size.

What I've tried till now

  1. It does the job but is a bit slow as it parses the file 256 times

    calculateFrequency :: L.ByteString -> [(Word8, Int64)]
    calculateFrequency f = foldl (\acc x -> (x, L.count x f):acc) [] [255, 254.. 0]
    
  2. I have also tried using Data.Map but the program runs out of memory (in ghc interpreter).

    import qualified Data.ByteString.Lazy as L
    import qualified Data.Map as M
    
    calculateFrequency' :: L.ByteString -> [(Word8, Int64)]
    calculateFrequency' xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) (M.empty) xs
    

Solution

  • Here's an implementation using mutable, unboxed vectors instead of higher level constructs. It also uses conduit for reading the file to avoid lazy I/O.

    import           Control.Monad.IO.Class
    import qualified Data.ByteString             as S
    import           Data.Conduit
    import           Data.Conduit.Binary         as CB
    import qualified Data.Conduit.List           as CL
    import qualified Data.Vector.Unboxed.Mutable as VM
    import           Data.Word                   (Word8)
    
    type Freq = VM.IOVector Int
    
    newFreq :: MonadIO m => m Freq
    newFreq = liftIO $ VM.replicate 256 0
    
    printFreq :: MonadIO m => Freq -> m ()
    printFreq freq =
        liftIO $ mapM_ go [0..255]
      where
        go i = do
            x <- VM.read freq i
            putStrLn $ show i ++ ": " ++ show x
    
    addFreqWord8 :: MonadIO m => Freq -> Word8 -> m ()
    addFreqWord8 f w = liftIO $ do
        let index = fromIntegral w
        oldCount <- VM.read f index
        VM.write f index (oldCount + 1)
    
    addFreqBS :: MonadIO m => Freq -> S.ByteString -> m ()
    addFreqBS f bs =
        loop (S.length bs - 1)
      where
        loop (-1) = return ()
        loop i = do
            addFreqWord8 f (S.index bs i)
            loop (i - 1)
    
    -- | The main entry point.
    main :: IO ()
    main = do
        freq <- newFreq
        runResourceT
            $  sourceFile "random"
            $$ CL.mapM_ (addFreqBS freq)
        printFreq freq
    

    I ran this on 500MB of random data and compared with @josejuan's UArray-based answer:

    • conduit based/mutable vectors: 1.006s
    • UArray: 17.962s

    I think it should be possible to keep much of the elegance of josejuan's high-level approach yet keep the speed of the mutable vector implementation, but I haven't had a chance to try implementing something like that yet. Also, note that with some general purpose helper functions (like Data.ByteString.mapM or Data.Conduit.Binary.mapM) the implementation could be significantly simpler without affecting performance.

    You can play with this implementation on FP Haskell Center as well.

    EDIT: I added one of those missing functions to conduit and cleaned up the code a bit; it now looks like the following:

    import           Control.Monad.Trans.Class   (lift)
    import           Data.ByteString             (ByteString)
    import           Data.Conduit                (Consumer, ($$))
    import qualified Data.Conduit.Binary         as CB
    import qualified Data.Vector.Unboxed         as V
    import qualified Data.Vector.Unboxed.Mutable as VM
    import           System.IO                   (stdin)
    
    freqSink :: Consumer ByteString IO (V.Vector Int)
    freqSink = do
        freq <- lift $ VM.replicate 256 0
        CB.mapM_ $ \w -> do
            let index = fromIntegral w
            oldCount <- VM.read freq index
            VM.write freq index (oldCount + 1)
        lift $ V.freeze freq
    
    main :: IO ()
    main = (CB.sourceHandle stdin $$ freqSink) >>= print
    

    The only difference in functionality is how the frequency is printed.