Search code examples
haskelllazy-evaluationstate-monadio-monad

Where is the memory leak in using StateT s IO a?


Intention: Small application to learn Haskell: Downloads a wikipedia-article, then downloads all articles linked from it, then downloads all articles linked from them, and so on... until a specified recursion depth is reached. The result is saved to a file.

Approach: Use a StateT to keep track of the download queue, to download an article and to update the queue. I build a list IO [WArticle] recursively and then print it.

Problem: While profiling I find that total memory in use is proportional to number of articles downloaded.

Analysis: By literature I'm lead to believe this is a laziness and/or strictness issue. BangPatterns reduced the memory consumed but didn't solve proportionality. Furthermore, I know all articles are downloaded before the file output is started.

Possible solutions:

1) The function getNextNode :: StateT CrawlState IO WArticle (below) already has IO. One solution would be to just do the file writing in it and only return the state. It would mean the file is written to in very small chunks though. Doesn't feel very Haskell..

2) Have the function buildHelper :: CrawlState -> IO [WArticle] (below) return [IO WArticle]. Though I wouldn't know how to rewrite that code and have been advised against it in the comments.

Are any of these proposed solutions better than I think they are or are there better alternatives?

import GetArticle (WArticle, getArticle, wa_links, wiki2File) -- my own
type URL = Text

data CrawlState =
     CrawlState  ![URL]       ![(URL, Int)]
          --    [Completed]    [(Queue, depth)]
-- Called by user
buildDB :: URL -> Int -> IO [WArticle]
buildDB startURL recursionDepth = buildHelper cs
    where cs = CrawlState [] [(startURL, recursionDepth)]

-- Builds list recursively
buildHelper :: CrawlState -> IO [WArticle]
buildHelper !cs@(CrawlState _ queue) = {-# SCC "buildHelper" #-}
  if null queue
    then return []
    else do
      (!article, !cs') <- runStateT getNextNode cs
      rest <- buildHelper cs'
      return (article:rest)

-- State manipulation
getNextNode :: StateT CrawlState IO WArticle
getNextNode = {-# SCC "getNextNode" #-} do
  CrawlState !parsed !queue@( (url, depth):queueTail ) <- get
  article <- liftIO $ getArticle url
  put $ CrawlState (url:parsed) (queueTail++ ( if depth > 1
          then let  !newUrls  = wa_links article \\ parsed
                    !newUrls' = newUrls          \\ map fst queue
                    in zip newUrls' (repeat (depth-1))
          else []))
  return article

startUrl = pack "https://en.wikipedia.org/wiki/Haskell_(programming_language)"
recursionDepth = 3

main :: IO ()
main =  {-# SCC "DbMain" #-}
  buildDB startUrl recursionDepth
   >>= return . wiki2File
   >>= writeFile "savedArticles.txt"

Full code at https://gitlab.com/mattias.br/sillyWikipediaSpider. Current version limited to only download the first eight links from each page to save time. Without changing it download 55 pages at ~600 MB heap usage.

Thanks for any help!


Solution

  • 2) Is [IO WArticle] want I want in this case?

    Not quite. The problem is that some of the IO WArticle actions depend on the result of a previous action: the links to future pages reside in previously obtained pages. [IO Warticle] can't provide that: it is pure in the sense that you can always find an action in the list without executing the previous actions.

    What we need is a kind of "effectful list" that lets us extract articles one by one, progressively performing the neccessary effects, but not forcing us to completely generate the list in one go.

    There are several libraries that provide these kinds of "effectful lists": streaming, pipes, conduit. They define monad transformers that extend a base monad with the ability to yield intermediate values before returning a final result. Usually the final result is of a type different from the values that are yielded; it might be simply unit ().

    Note: The Functor, Applicative and Monad instances for these libraries differ from the corresponding instances for pure lists. The Functor instances map over the resulting final value, not over the intermediate values which are yielded. To map over the yielded values, they provide separate functions. And The Monad instances sequence effectful lists, instead of trying all combinations. To try all combinations, they provide separate functions.

    Using the streaming library, we could modify buildHelper to something like this:

    import Streaming
    import qualified Streaming.Prelude as S
    
    buildHelper :: CrawlState -> Stream (Of WArticle) IO ()
    buildHelper !cs@(CrawlState _ queue) = 
      if null queue
        then return []
        else do (article, cs') <- liftIO (runStateT getNextNode cs)
                S.yield article
                buildHelper cs'
    

    And then we could use functions like mapM_ (from Streaming.Prelude, not the one from Control.Monad!) to process the articles one by one, as they are generated.