Why is this Monte Carlo Haskell program so slow?

update:

My compilation command is ghc -O2 Montecarlo.hs. My random version is random-1.1, ghc version is 8.6.4, and my system is macOS Big Sur 11.1 (Intel chip). The command I used to test the speed is time ./Montecarlo 10000000, and the result it returns is real 0m17.463s, user 0m17.176s, sys 0m0.162s.

The following is a Haskell program that uses Monte Carlo to calculate pi. However, when the input is 10 million, the program ran for 20 seconds. The C program written in the same logic took only 0.206 seconds. Why is this, and how can I speed it up? Thank you all.

This is the Haskell version:

import System.Random
import Data.List
import System.Environment

montecarloCircle :: Int -> Double
montecarloCircle x
   = 4*fromIntegral
        (foldl' (\x y -> if y <= 1 then x+1 else x) 0
           $ zipWith (\x y -> (x**2 + y**2))
                     (take x $ randomRs (-1.0,1) (mkStdGen 1) :: [Double])
                     (take x $ randomRs (-1.0,1) (mkStdGen 2) :: [Double]) )
        / fromIntegral x

main = do
    num <- getArgs
    let n = read (num !! 0)
    print $ montecarloCircle n

This is the C version:

#include <stdio.h>
#include <math.h>
#include <time.h>
#include <stdlib.h>

#define N       10000000

#define int_t   long // the type of N and M

// Rand from 0.0 to 1.0
double rand01()
{
    return rand()*1.0/RAND_MAX;
}

int main()
{
        srand((unsigned)time(NULL));

        double x,y;
        int_t M = 0;

        for (int_t i = 0;i < N;i++)
        {
                x = rand01();
                y = rand01();
                if (x*x+y*y<1) M++;
        }
        double pi = (double)4*M/N;
        printf("%lf\n", pi);
}

Solution

It is sort of expected that on numerical applications, Haskell code tends to be say 2X to 5X slower than equivalent code written in classic imperative languages such as C/C++/Fortran. However, Haskell code being 100 times slower is very unexpected !

Sanity check: reproducing the result

Using a somewhat oldish notebook with an Intel Celeron N2840 64 bits cpu, running Linux kernel 5.3, GLIBC 2.28, GCC 8.3, GHC 8.2.2, GHC random-1.1, we get the timings:

C code,        gcc -O3:    950 msec
Haskell code,  ghc -O3:    104 sec

So indeed, with these configurations, the Haskell code runs about 100 times slower than the C code.

Why is that ?

A first remark is that the π computing arithmetics atop random number generation looks quite simple, hence the runtimes are probably dominated by the generation of 2*10 millions pseudo-random numbers. Furthermore, we have every reason to expect that Haskell and C are using completely unrelated random number generation algorithms.

So, rather than comparing Haskell with C, we could be comparing the random number generation algorithms these 2 languages happen to be using, as well as their respective implementations.

In C, the language standard does not specify which algorithm function srand() is supposed to use, but that tends to be old, simple and fast algorithms.

On the other hand, in Haskell, we have traditionally seen a lot of complaints about the poor efficiency of StdGen, like here back in 2006:

https://gitlab.haskell.org/ghc/ghc/-/issues/427

where one of the leading Haskell luminaries mentions that StdGen could possibly be made 30 times faster.

Fortunately, help is already on the way. This recent blog post explains how the new Haskell random-1.2 package is solving the problem of StdGen lack of speed, using a completely different algorithm known as splitmix.

The field of pseudo-random number generation (PRNG) is a rather active one. Algorithms routinely get obsoleted by newer and better ones. For the sake of perspective, here is a relatively recent (2018) review paper on the subject.

Moving to more recent, better Haskell components:

Using another, slightly more powerful machine with an Intel Core i5-4440 3GHz 64 bits cpu, running Linux kernel 5.10, GLIBC 2.32, GCC 10.2, and critically Haskell package random-1.2:

C code,        gcc -O3:    164 msec
Haskell code,  ghc -O3:    985 msec

So Haskell is now “just” 6 times slower instead of 100 times.

And we still have to address the injustice of Haskell having to use x**2+y**2 versus C getting x*x+y*y, a detail which did not really matter before with random-1.1. This gives us 379 msec ! So we are back into the usual 2X-5X ballpark for Haskell to C speed comparisons.

Note that if we ask the Haskell executable to run with statistics on, we get the following output:

$ time q66441802.x +RTS -s -RTS 10000000
3.1415616
     925,771,512 bytes allocated in the heap
     ...
  Alloc rate    2,488,684,937 bytes per MUT second

  Productivity  99.3% of total user, 99.3% of total elapsed

real   0m0,379s
user   0m0,373s
sys    0m0,004s

so Haskell is found to allocate close to one gigabyte of memory along the way, which helps to understand the speed difference.

A code fix:

We note that the C code uses a single random serie while the Haskell code is using two, with two calls to mkStdGen with numbers 1 and 2 as seeds. This is not only unfair but also incorrect. Applied mathematicians who design PRNG algorithms take great care to ensure any single serie has the right statistical properties, but give essentially no guarantees about possible correlations between different series. It is not unheard of to even use the seed as an offset into a single global sequence.

This is fixed in the following code, which does not alter performance significantly:

computeNorms :: [Double] -> [Double]
computeNorms []  = []
computeNorms [x] = []
computeNorms (x:y:xs2) = (x*x + y*y) : (computeNorms xs2)

monteCarloCircle :: Int -> Double
monteCarloCircle nn =
    let
         randomSeed   =  1
         gen0         =  mkStdGen randomSeed
                      -- use a single random serie:
         uxys         =  (randomRs  (-1.0, 1.0)  gen0) :: [Double]
         norms        =  take nn (computeNorms uxys)
         insiderCount =  length  $  filter (<= 1.0) norms
    in
         (4.0::Double) * ((fromIntegral insiderCount) / (fromIntegral nn))

Side note:

The new Haskell random-1.2 package has been mentioned in this recent SO question, though in the context of the new monadic interface.

A word of conclusion:

Assuming the goal of the exercise is to gauge the relative runtime speeds of the C and Haskell languages, there are essentially two possibilities:

One is to avoid using PRNG altogether, because of the different algorithms in use.

The other one is to control random number generation by manually providing one and the same algorithm into both languages. For example, one could use the publicly available MRG32k3a algorithm designed by Pierre L'Ecuyer. Candidate Haskell MRG32k3a implementation here (as of March 2021). Left as an exercise for the reader.