performance haskell memory functional-programming haskell-stack

Is there a reason my Haskell program uses so much memory?

Given an input file of words separated by spaces, file size ~64mb:

main :: IO ()
main = do
    content <- getData "data_1.txt" -- opens file, reads contents, closes handle,returns
    let tokens = words content
        mappedData = map (\token -> (token, 1)) tokens
        keySet = Set.fromList tokens
        intermediateData = map (\key -> (key, map snd (filter (\kv -> fst kv == key) mappedData))) (Set.toList keySet)
        final = map (\pair -> (fst pair, foldr (+) 0 (snd pair))) intermediateData
    print final

content = ""
with open("data_1.txt", "r") as file:
    content = file.read()

tokens = content.split()

intermediate_data = []

for token in tokens:
    intermediate_data.append((token, 1))

keys = set()
for pair in intermediate_data:
    keys.add(pair[0])

grouped_values = []
for key in keys:
    values = [y for x, y in intermediate_data if x == key]
    grouped_values.append((key, values))

final = []
for elem in grouped_values:
    reduced = sum(elem[1])
    final.append((elem[0], reduced))

print(final)

The Haskell program uses 4.1 GB of RAM vs the Python program's 1.7 GB. They both do nearly the exact same thing, and while this example is 100% lazy evaluated, mostly strict evaluation basically doesn't improve usage at all. Is there something obvious that I am doing wrong?

I could parallelize the Haskell program or use some more efficient data structures, but there seems to be an underlying issue since the RAM usage is ~2.5x greater than Python. I imagine if I used a faster compiled language, the RAM usage would be even less.

Solution

String is a linked list of characters. That's horrible in terms of memory footprint (≈128 bits per character, when 8 would be enough for English text). This type should only be used for strings that have either need to be processed lazily (specifically, character-by-character lazy), or are so few and short that efficiency doesn't matter (and you are confident that it won't be more during runtime in an unusual use case).
If that's not the case, you should either use ByteString (for exact encoding control and binary data), or Text if using it as, well, text which seems to be the case here.
You're storing multiple lists of the words, in simple and annotated forms, in addition to the set of words. Because of the multiple uses, GHC probably won't be able to fuse them away, and at least one of the lists will be fully represented in memory instead of being incrementally garbage collected. Whether this actually matters depends on the length of the individual words: if these are long enough strings, then the actual character storage will dominate memory usage anyway.
Don't use foldr (+) for computing a sum. Use simply sum, which has better strictness built in to make this efficient. (Doesn't actually matter here.)
Your algorithm is super inefficient, looping over both all the words in order and the set of words. This is quite unnecessary, since the creation of a Set already involves merging equal words together. If you switch to the similar Map, you can tap into that merging for accumulating the counters, and thus never need to store all those redundant 1-s.

Try this:

import qualified Data.Text as T
import qualified Data.Map as M

main :: IO ()
main = do
    content <- T.readFile "data_1.txt"
    let tokens = T.words content
        mappedData = Map.fromListWith (+)
            $ map (\token -> (token, 1)) tokens
        final = Map.toList mappedData
    print final

... or shorter

{-# LANGUAGE TupleSections #-}

main = do
    content <- T.readFile "data_1.txt"
    print . Map.toList . Map.fromListWith (+)
        $ (,1) <$> T.words content

Using readFile actually still makes this far from ideal, because that uses old Chars under the hood. In this case, the char-list should get fused away though / incrementally garbage collected, so that not all of them reside in memory simultaneously, only the resulting Text is.

You could take this even further, by not even reading in the complete text but only doing it lazily. That should make it a lot more memory-slim yet.

import qualified Data.ByteString.Lazy as BL
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TL

main = do
    content <- TL.decodeUtf8 <$> BL.readFile "data_1.txt"
    print . Map.toList . Map.fromListWith (+)
        $ (,1) . TL.toStrict <$> TL.words content

In this version, the runtime would be able to garbage-collect the first words already before most of the others are ever read into memory, and only the duplicate-reduced form needs to be stored permanently.

But do note that lazy IO is fiddly, it's easy to accidentally e.g. close a file before you're really finished with it. If using this in a real application, you should ensure the list of results is fully evaluated / in memory before handing on control.