Search code examples
loopshaskelllazy-evaluationfold

Fusing multiple foldl' in Haskell


I'm trying to read and analyse a huge CSV file. I used Data.Csv.Streaming from cassava, and functions are applied in the following order:

Data.ByteString.Lazy.readFile -- Gives lazy stream
Data.Csv.Streaming.decodeByname -- Gives Either String (Header Records t)
\(Right (_, v)) -> v -- Gives right side of either (Records t)
Data.Foldable.toList -- Gives [t]

After this the program enters the analysis stage, and executes four (this is very important) different instances (i.e. with different filters) of the following

filter -- Result of toList is applied through a filter
map
Data.Foldable.foldl' -- Does bin counting using a map. The map has at most 60 keys.

However, it appears that the program takes up a huge amount of memory while attempting to load the entire CSV file.

If I only have one instance of foldl' executing, the program does a nice single pass through the CSV data and doesn't consume as much memory. Is there a way to fuse the foldl's together? That is, having

x = foldl' f Map.empty $ filter cx li
y = foldl' f Map.empty $ filter cy li
...

and force it to execute in single pass.

Edit: The following function is used in foldl with Data.Map.Strict as Map:

bincollect :: Ord a => Num b => Map.Map a b -> a -> Map.Map a b
bincollect !m !key = Map.insertWith (+) key 1 m

and the foldl begins with an empty map.

The memory usage grows with the number of elements taked with or without optimization on.


Solution

  • Yes, you can indeed fuse the four folds together, but you'll have to do it manually. You could try and write out the logic yourself, or you could use a library (like foldl) to help. For instance, you can turn your bincollect into a fold:

    bincollect :: (Ord a, Num b) => Fold a (Map.Map a b)
    bincollect = Fold (\m key -> Map.insertWith (+) key 1 m) Map.empty id
    

    Then, you can filter using prefilter:

    x = prefilter cx bincollect
    

    Finally, you can combine them together using the Applicative instance:

    (w,x,y,z) = fold ((,,,) <$> prefilter cw bincollect
                            <*> prefilter cx bincollect
                            <*> prefilter cy bincollect
                            <*> prefilter cz bincollect)
                     input