I'm trying to read and analyse a huge CSV file. I used Data.Csv.Streaming
from cassava, and functions are applied in the following order:
Data.ByteString.Lazy.readFile -- Gives lazy stream
Data.Csv.Streaming.decodeByname -- Gives Either String (Header Records t)
\(Right (_, v)) -> v -- Gives right side of either (Records t)
Data.Foldable.toList -- Gives [t]
After this the program enters the analysis stage, and executes four (this is very important) different instances (i.e. with different filters) of the following
filter -- Result of toList is applied through a filter
map
Data.Foldable.foldl' -- Does bin counting using a map. The map has at most 60 keys.
However, it appears that the program takes up a huge amount of memory while attempting to load the entire CSV file.
If I only have one instance of foldl' executing, the program does a nice single pass through the CSV data and doesn't consume as much memory. Is there a way to fuse the foldl's together? That is, having
x = foldl' f Map.empty $ filter cx li
y = foldl' f Map.empty $ filter cy li
...
and force it to execute in single pass.
Edit: The following function is used in foldl
with Data.Map.Strict
as Map
:
bincollect :: Ord a => Num b => Map.Map a b -> a -> Map.Map a b
bincollect !m !key = Map.insertWith (+) key 1 m
and the foldl begins with an empty map.
The memory usage grows with the number of elements take
d with or without optimization on.
Yes, you can indeed fuse the four folds together, but you'll have to do it manually. You could try and write out the logic yourself, or you could use a library (like foldl) to help. For instance, you can turn your bincollect into a fold:
bincollect :: (Ord a, Num b) => Fold a (Map.Map a b)
bincollect = Fold (\m key -> Map.insertWith (+) key 1 m) Map.empty id
Then, you can filter using prefilter
:
x = prefilter cx bincollect
Finally, you can combine them together using the Applicative
instance:
(w,x,y,z) = fold ((,,,) <$> prefilter cw bincollect
<*> prefilter cx bincollect
<*> prefilter cy bincollect
<*> prefilter cz bincollect)
input