TemplateHaskell memory usage when compiling

I have a memory consumption problem when using TemplateHaskell in RuzzSolver, one of my Haskell project. Sources of RuzzSolver are available on GitHub .

To achieve good performance, I load a ~380000 words dictionary into a Tree structure (from the containers package). This greatly speeds up the solving of grids but the loading itself takes some time (between 1 and 2 seconds depending on the CPU).

I would like to create the structure directly during compile-time with TemplateHaskell.

Therefore I transformed the dictionary loading:

-- Dictionary.hs, line 155
getDictionary :: String -> IO Dictionary
getDictionary dictionaryFilePath = do
    content <- readFile dictionaryFilePath
    return $ foldl (+++) [] (createTree <$> lines content)

into this function:

-- Dictionary.hs, line 164
getDictionaryQ :: String -> Q Exp
getDictionaryQ dictionaryFilePath = do
    content <- runIO $ readFile dictionaryFilePath
    lift $ foldl (+++) [] (createTree <$> lines content)

view Dictionary.hs

It allowed me to go from:

-- ruzzSolver.hs, line 68
dictionary <- getDictionary "dictionary/ruzzdictionary.txt"

to:

-- ruzzSolver.hs, line 68
let dictionary = $(getDictionaryQ "dictionary/ruzzdictionary.txt")

view ruzzSolver.hs

It (should) works but it requires too much memory to compile! On my 8 Gb PC, I had to stop GHC when it reached consumption of 12 GB. Reducing the dictionary to 38000 words lets it compile but it still needs 3 to 4 GB.

Is there a way to make GHC use less memory when compiling this TemplateHaskell code? Or another way to embed this structure into the executable?

Solution

Perhaps you can "embed" the trie into the executable to save on loading and creation time, but one problem I foresee is that conventional Haskell data structures are quite bloated compared to data structures in other languages.

Also, most containers allow for insertion and deletion, but it looks like your data is constant so you'll only need the final data structure. Moreover you'll only be using it for queries like:

Does this word exist in the dictionary?
And, is this string a prefix of some word in the dictionary?

You want a compact representation of the dictionary with some sort of pre-computed index to make lookups fast.

Some options:

Option 1: Create a BerkeleyDB database.

Such a database allows greater-than and less-than queries.

Pros: No database load time.

Cons: Queries require disk accesses. Although, once pages are read by the OS they should be cached and subsequent reads should be fast.

Note - I've written a boggle solver in perl using a Berkeley DB, so this approach is very viable.

Similar to a BerkeleyDB is a CDB (Constant Database) for which there is also a Haskell package. However, a CDB only supports equality queries, so it's probably not usable for your application.

Option 2. Represent the dictionary simply as the sorted file of the words. Create a custom index to make queries efficient.

A simple index could just be a 26*26*26 element array indicating the offset into the file of each three-letter prefix. Such a small index can be compiled into the program. Load the dictionary as a single (strict) ByteString.

Use the index and binary search within the byte string to resolve queries. Perhaps the ByteString functions will work well here, but as a last resort you can always use an Int offset into the loaded dictionary as a "pointer" which you can move around to find the beginning of the next word.

You might be able to compile the dictionary ByteString into the executable, but loading 4 MB of data shouldn't take too long - especially if it is already in the OS cache.

Update: An example of this second idea may be found here.