haskell bit-manipulation bit-fields lfsr

Efficient bit-fiddling in a LFSR implementation

Although I have a good LSFR C implementation I thought I'd try the same in Haskell - just to see how it goes. What I came up with, so far, is two orders of magnitude slower than the C implementation, which begs the question: How can the performance be improved? Clearly, the bit-fiddling operations are the bottleneck, and the profiler confirms this.

Here's the baseline Haskell code using lists and Data.Bits:

import           Control.Monad      (when)
import           Data.Bits          (Bits, shift, testBit, xor, (.&.), (.|.))
import           System.Environment (getArgs)
import           System.Exit        (exitFailure, exitSuccess)

tap :: [[Int]]
tap = [
    [],            [],            [],            [3, 2],
    [4, 3],        [5, 3],        [6, 5],        [7, 6],
    [8, 6, 5, 4],  [9, 5],        [10, 7],       [11, 9],
    [12, 6, 4, 1], [13, 4, 3, 1], [14, 5, 3, 1], [15, 14],
    [16,15,13,4],  [17, 14],      [18, 11],      [19, 6, 2, 1],
    [20, 17],      [21, 19],      [22, 21],      [23, 18],
    [24,23,22,17], [25, 22],      [26, 6, 2, 1], [27, 5, 2, 1],
    [28, 25],      [29, 27],      [30, 6, 4, 1], [31, 28],
    [32,22,2,1],   [33,20],       [34,27,2,1],   [35,33],
    [36,25],       [37,5,4,3,2,1],[38,6,5,1],    [39,35],
    [40,38,21,19], [41,38],       [42,41,20,19], [43,42,38,37],
    [44,43,18,17], [45,44,42,41], [46,45,26,25], [47,42],
    [48,47,21,20], [49,40],       [50,49,24,23], [51,50,36,35],
    [52,49],       [53,52,38,37], [54,53,18,17], [55,31],
    [56,55,35,34], [57,50],       [58,39],       [59,58,38,37],
    [60,59],       [61,60,46,45], [62,61,6,5],   [63,62]        ]

xor' :: [Bool] -> Bool
xor' = foldr xor False

mask ::  (Num a, Bits a) => Int -> a
mask len = shift 1 len - 1

advance :: Int -> [Int] -> Int -> Int
advance len tap lfsr
    | d0        = shifted
    | otherwise = shifted .|. 1
    where
        shifted = shift lfsr 1 .&. mask len
        d0 = xor' $ map (testBit lfsr) tap'
        tap' = map (subtract 1) tap

main :: IO ()
main = do
    args <- getArgs
    when (null args) $ fail "Usage: lsfr <number-of-bits>"
    let len = read $ head args
    when (len < 8) $ fail "No need for LFSR"
    let out = last $ take (shift 1 len) $ iterate (advance len (tap!!len)) 0
    if out == 0 then do
        putStr "OK\n"
        exitSuccess
    else do
        putStr "FAIL\n"
        exitFailure

Basically it tests whether the LSFR defined in tap :: [[Int]] for any given bit-length is of maximum-length. (More precisely, it just checks whether the LSFR reaches the initial state (zero) after 2ⁿ iterations.)

According to the profiler the most costly line is the feedback bit d0 = xor' $ map (testBit lfsr) tap'.

What I've tried so far:

use Data.Array: Attempt abandoned because there's no foldl/r
use Data.Vector: Slightly faster than the baseline

The compiler options I use are: -O2, LTS Haskell 8.12 (GHC-8.0.2).

The reference C++ program can be found on gist.github.com.

The Haskell code can't be expected (?) to run as fast as the C code, but two orders of magnitude is too much, there must be a better way to do the bit-fiddling.

Update: Results of applying the optimisations suggested in the answers

The reference C++ program with input 28, compiled with LLVM 8.0.0, runs in 0.67s on my machine (the same with clang 3.7 is marginally slower, 0.68s)
The baseline Haskell code runs about 100x slower (because of the space inefficiency don't try it with inputs larger than 25)
With the rewrite of @Thomas M. DuBuisson, still using the default GHC backend, the execution time goes down to 5.2s
With the rewrite of @Thomas M. DuBuisson, now using the LLVM backend (GHC option -O2 -fllvm), the execution time goes down to 1.7s
- Using GHC option -O2 -fllvm -optlc -mcpu=native brings this to 0.73s
Replacing iterate with iterate' of @cirdec makes no difference when Thomas' code is used (both with the default 'native' backend and LLVM). However, it does make a difference when the baseline code is used.

So, we've come from 100x to 8x to 1.09x, i.e. only 9% slower than C!

Note The LLVM backend to GHC 8.0.2 requires LLVM 3.7. On Mac OS X this means installing this version with brew and then symlinking opt and llc. See 7.10. GHC Backends.

Solution

Up Front Matters

For starters, I'm using GHC 8.0.1 on an Intel I5 ~2.5GHz, linux x86-64.

First Draft: Oh No! The slows!

Your starting code with parameter 25 runs:

% ghc -O2 orig.hs && time ./orig 25
[1 of 1] Compiling Main             ( orig.hs, orig.o )
Linking orig ...
OK
./orig 25  7.25s user 0.50s system 99% cpu 7.748 total

So the time to beat is 77ms - two orders of magnitude better than this Haskell code. Lets dive in.

Issue 1: Shifty Code

I found a couple of oddities with the code. First was the use of shift in high performance code. Shift supports both left and right shift and to do so it requires a branch. Lets kill that with more readable powers of two and such (shift 1 x ~> 2^x and shift x 1 ~> 2*x):

% ghc -O2 noShift.hs && time ./noShift 25
[1 of 1] Compiling Main             ( noShift.hs, noShift.o )
Linking noShift ...
OK
./noShift 25  0.64s user 0.00s system 99% cpu 0.637 total

(As you noted in the comments: Yes, this bears investigation. It might be that some oddity of the prior code was preventing a rewrite rule from firing and, as a result, much worse code resulted)

Issue 2: Lists Of Bits? Int operations save the day!

One change, one order of magnitude. Yay. What else? Well you have this awkward list of bit locations you're tapping that just seems like its begging for inefficiency and/or leans on fragile optimizations. At this point I'll note that hard-coding any one selection from that list results in really good performance (such as testBit lsfr 24 `xor` testBit lsfr 21) but we want a more general fast solution.

I propose we compute the mask of all the tap locations then do a one-instruction pop count. To do this we only need a single Int passed in to advance instead of a whole list. The popcount instruction requires good assembly generation which requires llvm and probably -optlc-mcpu=native or another instruction set selection that is non-pessimistic.

This step gives us pc below. I've folded in the guard-removal of advance that was mentioned in the comments:

let tp = sum $ map ((2^) . subtract 1) (tap !! len)
    pc lfsr = fromEnum (even (popCount (lfsr .&. tp)))
    mask = 2^len - 1
    advance' :: Int -> Int
    advance' lfsr = (2*lfsr .&. mask) .|. pc lfsr 
    out :: Int
    out = last $ take (2^len) $ iterate advance' 0

Our resulting performance is:

% ghc -O2 so.hs -fforce-recomp -fllvm -optlc-mcpu=native && time ./so 25      
[1 of 1] Compiling Main             ( so.hs, so.o )
Linking so ...
OK
./so 25  0.06s user 0.00s system 96% cpu 0.067 total

That's over two orders of magnitude from start to finish, so hopefully it matches your C. Finally, in deployed code it is actually really common to have Haskell packages with C bindings but this is often an educational exercise so I hope you had fun.

Edit: The now-available C++ code takes my system 0.10 (g++ -O3) and 0.12 (clang++ -O3 -march=native) seconds, so it seems we've beat our mark by a fair bit.