Intention: Small application to learn Haskell: Downloads a wikipedia-article, then downloads all articles linked from it, then downloads all articles linked from them, and so on... until a specified recursion depth is reached. The result is saved to a file.
Approach: Use a StateT
to keep track of the download queue, to download an article and to update the queue. I build a list IO [WArticle]
recursively and then print it.
Problem: While profiling I find that total memory in use is proportional to number of articles downloaded.
Analysis: By literature I'm lead to believe this is a laziness and/or strictness issue. BangPatterns reduced the memory consumed but didn't solve proportionality. Furthermore, I know all articles are downloaded before the file output is started.
Possible solutions:
1) The function getNextNode :: StateT CrawlState IO WArticle
(below) already has IO. One solution would be to just do the file writing in it and only return the state. It would mean the file is written to in very small chunks though. Doesn't feel very Haskell..
2) Have the function buildHelper :: CrawlState -> IO [WArticle]
(below) return [IO WArticle]
. Though I wouldn't know how to rewrite that code and have been advised against it in the comments.
Are any of these proposed solutions better than I think they are or are there better alternatives?
import GetArticle (WArticle, getArticle, wa_links, wiki2File) -- my own
type URL = Text
data CrawlState =
CrawlState ![URL] ![(URL, Int)]
-- [Completed] [(Queue, depth)]
-- Called by user
buildDB :: URL -> Int -> IO [WArticle]
buildDB startURL recursionDepth = buildHelper cs
where cs = CrawlState [] [(startURL, recursionDepth)]
-- Builds list recursively
buildHelper :: CrawlState -> IO [WArticle]
buildHelper !cs@(CrawlState _ queue) = {-# SCC "buildHelper" #-}
if null queue
then return []
else do
(!article, !cs') <- runStateT getNextNode cs
rest <- buildHelper cs'
return (article:rest)
-- State manipulation
getNextNode :: StateT CrawlState IO WArticle
getNextNode = {-# SCC "getNextNode" #-} do
CrawlState !parsed !queue@( (url, depth):queueTail ) <- get
article <- liftIO $ getArticle url
put $ CrawlState (url:parsed) (queueTail++ ( if depth > 1
then let !newUrls = wa_links article \\ parsed
!newUrls' = newUrls \\ map fst queue
in zip newUrls' (repeat (depth-1))
else []))
return article
startUrl = pack "https://en.wikipedia.org/wiki/Haskell_(programming_language)"
recursionDepth = 3
main :: IO ()
main = {-# SCC "DbMain" #-}
buildDB startUrl recursionDepth
>>= return . wiki2File
>>= writeFile "savedArticles.txt"
Full code at https://gitlab.com/mattias.br/sillyWikipediaSpider. Current version limited to only download the first eight links from each page to save time. Without changing it download 55 pages at ~600 MB heap usage.
Thanks for any help!
2) Is [IO WArticle] want I want in this case?
Not quite. The problem is that some of the IO WArticle
actions depend on the result of a previous action: the links to future pages reside in previously obtained pages. [IO Warticle]
can't provide that: it is pure in the sense that you can always find an action in the list without executing the previous actions.
What we need is a kind of "effectful list" that lets us extract articles one by one, progressively performing the neccessary effects, but not forcing us to completely generate the list in one go.
There are several libraries that provide these kinds of "effectful lists": streaming, pipes, conduit. They define monad transformers that extend a base monad with the ability to yield intermediate values before returning a final result. Usually the final result is of a type different from the values that are yielded; it might be simply unit ()
.
Note: The Functor
, Applicative
and Monad
instances for these libraries differ from the corresponding instances for pure lists. The Functor
instances map over the resulting final value, not over the intermediate values which are yielded. To map over the yielded values, they provide separate functions. And The Monad
instances sequence effectful lists, instead of trying all combinations. To try all combinations, they provide separate functions.
Using the streaming library, we could modify buildHelper
to something like this:
import Streaming
import qualified Streaming.Prelude as S
buildHelper :: CrawlState -> Stream (Of WArticle) IO ()
buildHelper !cs@(CrawlState _ queue) =
if null queue
then return []
else do (article, cs') <- liftIO (runStateT getNextNode cs)
S.yield article
buildHelper cs'
And then we could use functions like mapM_
(from Streaming.Prelude
, not the one from Control.Monad
!) to process the articles one by one, as they are generated.