Search code examples
haskellbytestring

ByteString histogram


Given a (strict) ByteString, what is the most efficient way to count how many of each possible byte it contains?

I see that sort is supposed to implemented as a counting sort - but there doesn't appear to be a way to access the associated counts. I also see that there's a count function, which counts the number of times a given byte appears. This gives me the following options:

  • map (\ b -> count b str) [0x00 .. 0xFF]
  • map length . group . sort
  • Something with fold* and an IntMap of byte frequencies.

Which is likely to give me best performance?


Solution

  • The basic idea of dflemstr is of course right, but since you want the best performance, you need to use unchecked access to the ByteString as well as to the counting array, like

    import Control.Monad.ST
    import Data.Array.ST
    import Data.Array.Base (unsafeRead, unsafeWrite)
    import Data.Array.Unboxed
    
    import Data.Word
    
    import Data.ByteString (ByteString)
    import qualified Data.ByteString as BS
    import Data.ByteString.Unsafe
    
    histogram :: ByteString -> UArray Word8 Int
    histogram bs = runSTUArray $ do
        hist <- newArray (0, 255) 0
        let l = BS.length bs
            update b = do
                o <- unsafeRead hist b
                unsafeWrite hist b (o+1)
            loop i
                | i < 0     = return hist
                | otherwise = do
                    update $ fromIntegral (bs `unsafeIndex` i)
                    loop (i-1)
        loop (l-1)
    

    That makes a considerable difference, according to criterion (building a histogram of a 200000 long ByteString):

    warming up
    estimating clock resolution...
    mean is 1.667687 us (320001 iterations)
    found 3078 outliers among 319999 samples (1.0%)
      1947 (0.6%) high severe
    estimating cost of a clock call...
    mean is 40.43765 ns (14 iterations)
    
    benchmarking dflemstr
    mean: 21.42852 ms, lb 21.05213 ms, ub 21.77954 ms, ci 0.950
    std dev: 1.873897 ms, lb 1.719565 ms, ub 2.038779 ms, ci 0.950
    variance introduced by outliers: 74.820%
    variance is severely inflated by outliers
    
    benchmarking unsafeIndex
    mean: 312.6447 us, lb 304.3425 us, ub 321.0795 us, ci 0.950
    std dev: 42.86886 us, lb 39.64363 us, ub 46.52899 us, ci 0.950
    variance introduced by outliers: 88.342%
    variance is severely inflated by outliers
    

    (I changed dflemstr's code to also use runSTUArray and return a UArray Word8 Int to have uiform return values, that doesn't make a big difference in running time, though.)