ByteString histogram

Given a (strict) ByteString, what is the most efficient way to count how many of each possible byte it contains?

I see that sort is supposed to implemented as a counting sort - but there doesn't appear to be a way to access the associated counts. I also see that there's a count function, which counts the number of times a given byte appears. This gives me the following options:

map (\ b -> count b str) [0x00 .. 0xFF]
map length . group . sort
Something with fold* and an IntMap of byte frequencies.

Which is likely to give me best performance?

Solution

The basic idea of dflemstr is of course right, but since you want the best performance, you need to use unchecked access to the ByteString as well as to the counting array, like

import Control.Monad.ST
import Data.Array.ST
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.Unboxed

import Data.Word

import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.ByteString.Unsafe

histogram :: ByteString -> UArray Word8 Int
histogram bs = runSTUArray $ do
    hist <- newArray (0, 255) 0
    let l = BS.length bs
        update b = do
            o <- unsafeRead hist b
            unsafeWrite hist b (o+1)
        loop i
            | i < 0     = return hist
            | otherwise = do
                update $ fromIntegral (bs `unsafeIndex` i)
                loop (i-1)
    loop (l-1)

That makes a considerable difference, according to criterion (building a histogram of a 200000 long ByteString):

warming up
estimating clock resolution...
mean is 1.667687 us (320001 iterations)
found 3078 outliers among 319999 samples (1.0%)
  1947 (0.6%) high severe
estimating cost of a clock call...
mean is 40.43765 ns (14 iterations)

benchmarking dflemstr
mean: 21.42852 ms, lb 21.05213 ms, ub 21.77954 ms, ci 0.950
std dev: 1.873897 ms, lb 1.719565 ms, ub 2.038779 ms, ci 0.950
variance introduced by outliers: 74.820%
variance is severely inflated by outliers

benchmarking unsafeIndex
mean: 312.6447 us, lb 304.3425 us, ub 321.0795 us, ci 0.950
std dev: 42.86886 us, lb 39.64363 us, ub 46.52899 us, ci 0.950
variance introduced by outliers: 88.342%
variance is severely inflated by outliers

(I changed dflemstr's code to also use runSTUArray and return a UArray Word8 Int to have uiform return values, that doesn't make a big difference in running time, though.)