Search code examples
haskelllarge-data

Process large lists in Haskell into a single value


I have 2 lists of Int of equal size (roughly 10,000 elements): say x and y. I need to compute a product of the following expression for each corresponding pair of elements from the lists: x_i/(x_i+y_i), i.e. x_i and y_i are the first elements of the lists, then second, etc.

My approaches work fine on small test cases, but ghci hangs for the larger lists. Any insight as to the cause and solution would be appreciated.

I tried to do this with fold, zipping the lists first:

getP:: [Int] -> [Int] -> Double
getP zippedCounts = foldr (\(x,y) acc -> let intX = fromIntegral x
                                             intY = fromIntegral y
                                             intSum = intX + intY in
                                             (acc*(intX/intSum))) 
                          1.0
                          zippedCounts

I also tried recusion:

getP lst [] = 1.0
getP [] lst = 1.0
getP (h1:t1) (h2:t2) = ((fromIntegral h1) / ((fromIntegral h1) + (fromIntegral h2))) * (getP t1 t2)

As well as list comprehension:

getP lst1 lst2 = (product [((fromIntegral x) / ((fromIntegral x) + (fromIntegral y)))|x <- lst1, y <- lst2])

Solution

  • All three solutions have space leaks, maybe that's what causes the unresponsiveness.

    In Haskell, when reducing a big list to a single summary value, it is very easy to inadvertently cause space leaks if we never "look into" the intermediate values of the computation. We can end up with a gigantic tree of unevaluated thunks hiding behind a seemingly inoffensive single Double value.


    The foldr example leaks because foldr never forces its accumulator to weak head normal form. Use the strict left foldl' instead (you will need to reorder some function arguments). foldl' should ensure that the intermediate Double values remain "small" and thunks don't accumulate.


    The explicit recursion example is dangerous because it is not tail-recursive and for large lists can cause a stack overflow (we repeatedly put values in the stack, waiting for the next recursive call to complete). A solution would involve making the function tail-recursive by passing the intermediate result as an extra parameter, and putting a bang pattern on that parameter to ensure thunks don't accumulate.


    The product example leaks because, unfortunately, neither the sum nor the product functions are strict. For large lists, it's better to use foldl' instead. (There's also a bug, as it has been mentioned in the comments.)