Given an input file of words separated by spaces, file size ~64mb:
main :: IO ()
main = do
content <- getData "data_1.txt" -- opens file, reads contents, closes handle,returns
let tokens = words content
mappedData = map (\token -> (token, 1)) tokens
keySet = Set.fromList tokens
intermediateData = map (\key -> (key, map snd (filter (\kv -> fst kv == key) mappedData))) (Set.toList keySet)
final = map (\pair -> (fst pair, foldr (+) 0 (snd pair))) intermediateData
print final
vs
content = ""
with open("data_1.txt", "r") as file:
content = file.read()
tokens = content.split()
intermediate_data = []
for token in tokens:
intermediate_data.append((token, 1))
keys = set()
for pair in intermediate_data:
keys.add(pair[0])
grouped_values = []
for key in keys:
values = [y for x, y in intermediate_data if x == key]
grouped_values.append((key, values))
final = []
for elem in grouped_values:
reduced = sum(elem[1])
final.append((elem[0], reduced))
print(final)
The Haskell program uses 4.1 GB of RAM vs the Python program's 1.7 GB. They both do nearly the exact same thing, and while this example is 100% lazy evaluated, mostly strict evaluation basically doesn't improve usage at all. Is there something obvious that I am doing wrong?
I could parallelize the Haskell program or use some more efficient data structures, but there seems to be an underlying issue since the RAM usage is ~2.5x greater than Python. I imagine if I used a faster compiled language, the RAM usage would be even less.
String
is a linked list of characters. That's horrible in terms of memory footprint (≈128 bits per character, when 8 would be enough for English text). This type should only be used for strings that have either need to be processed lazily (specifically, character-by-character lazy), or are so few and short that efficiency doesn't matter (and you are confident that it won't be more during runtime in an unusual use case).ByteString
(for exact encoding control and binary data), or Text
if using it as, well, text which seems to be the case here.foldr (+)
for computing a sum. Use simply sum
, which has better strictness built in to make this efficient. (Doesn't actually matter here.)Set
already involves merging equal words together. If you switch to the similar Map
, you can tap into that merging for accumulating the counters, and thus never need to store all those redundant 1
-s.Try this:
import qualified Data.Text as T
import qualified Data.Map as M
main :: IO ()
main = do
content <- T.readFile "data_1.txt"
let tokens = T.words content
mappedData = Map.fromListWith (+)
$ map (\token -> (token, 1)) tokens
final = Map.toList mappedData
print final
... or shorter
{-# LANGUAGE TupleSections #-}
main = do
content <- T.readFile "data_1.txt"
print . Map.toList . Map.fromListWith (+)
$ (,1) <$> T.words content
Using readFile
actually still makes this far from ideal, because that uses old Char
s under the hood. In this case, the char-list should get fused away though / incrementally garbage collected, so that not all of them reside in memory simultaneously, only the resulting Text
is.
You could take this even further, by not even reading in the complete text but only doing it lazily. That should make it a lot more memory-slim yet.
import qualified Data.ByteString.Lazy as BL
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TL
main = do
content <- TL.decodeUtf8 <$> BL.readFile "data_1.txt"
print . Map.toList . Map.fromListWith (+)
$ (,1) . TL.toStrict <$> TL.words content
In this version, the runtime would be able to garbage-collect the first words already before most of the others are ever read into memory, and only the duplicate-reduced form needs to be stored permanently.
But do note that lazy IO is fiddly, it's easy to accidentally e.g. close a file before you're really finished with it. If using this in a real application, you should ensure the list of results is fully evaluated / in memory before handing on control.