Some of the threads will be terminated before finished if the code is compiled to a console program or run as fsi --use:Program.fs --exec --quiet. Any way to wait for all the threads ending?
This issue can be described as "program exit problem when multiple MailboxProcessers exist".
Output example
(Note the last line is truncated and the last output function (printfn "[Main] after crawl"
) is never executed.)
[Main] before crawl [Crawl] before return result http://news.google.com crawled by agent 1. [supervisor] reached limit Agent 5 is done. http://www.gstatic.com/news/img/favicon.ico crawled by agent 1. [supervisor] reached limit Agent 1 is done. http://www.google.com/imghp?hl=en&tab=ni crawled by agent 4. [supervisor] reached limit Agent 4 is done. http://www.google.com/webhp?hl=en&tab=nw crawled by agent 2. [supervisor] reached limit Agent 2 is done. http://news.google.
Code
Edit: added several System.Threading.Thread.CurrentThread.IsBackground <- false
.
open System
open System.Collections.Concurrent
open System.Collections.Generic
open System.IO
open System.Net
open System.Text.RegularExpressions
module Helpers =
type Message =
| Done
| Mailbox of MailboxProcessor<Message>
| Stop
| Url of string option
| Start of AsyncReplyChannel<unit>
// Gates the number of crawling agents.
[<Literal>]
let Gate = 5
// Extracts links from HTML.
let extractLinks html =
let pattern1 = "(?i)href\\s*=\\s*(\"|\')/?((?!#.*|/\B|" +
"mailto:|location\.|javascript:)[^\"\']+)(\"|\')"
let pattern2 = "(?i)^https?"
let links =
[
for x in Regex(pattern1).Matches(html) do
yield x.Groups.[2].Value
]
|> List.filter (fun x -> Regex(pattern2).IsMatch(x))
links
// Fetches a Web page.
let fetch (url : string) =
try
let req = WebRequest.Create(url) :?> HttpWebRequest
req.UserAgent <- "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
req.Timeout <- 5000
use resp = req.GetResponse()
let content = resp.ContentType
let isHtml = Regex("html").IsMatch(content)
match isHtml with
| true -> use stream = resp.GetResponseStream()
use reader = new StreamReader(stream)
let html = reader.ReadToEnd()
Some html
| false -> None
with
| _ -> None
let collectLinks url =
let html = fetch url
match html with
| Some x -> extractLinks x
| None -> []
open Helpers
// Creates a mailbox that synchronizes printing to the console (so
// that two calls to 'printfn' do not interleave when printing)
let printer =
MailboxProcessor.Start(fun x -> async {
while true do
let! str = x.Receive()
System.Threading.Thread.CurrentThread.IsBackground <- false
printfn "%s" str })
// Hides standard 'printfn' function (formats the string using
// 'kprintf' and then posts the result to the printer agent.
let printfn fmt =
Printf.kprintf printer.Post fmt
let crawl url limit =
// Concurrent queue for saving collected urls.
let q = ConcurrentQueue<string>()
// Holds crawled URLs.
let set = HashSet<string>()
let supervisor =
MailboxProcessor.Start(fun x -> async {
System.Threading.Thread.CurrentThread.IsBackground <- false
// The agent expects to receive 'Start' message first - the message
// carries a reply channel that is used to notify the caller
// when the agent completes crawling.
let! start = x.Receive()
let repl =
match start with
| Start repl -> repl
| _ -> failwith "Expected Start message!"
let rec loop run =
async {
let! msg = x.Receive()
match msg with
| Mailbox(mailbox) ->
let count = set.Count
if count < limit - 1 && run then
let url = q.TryDequeue()
match url with
| true, str -> if not (set.Contains str) then
let set'= set.Add str
mailbox.Post <| Url(Some str)
return! loop run
else
mailbox.Post <| Url None
return! loop run
| _ -> mailbox.Post <| Url None
return! loop run
else
printfn "[supervisor] reached limit"
// Wait for finishing
mailbox.Post Stop
return! loop run
| Stop -> printfn "[Supervisor] stop"; return! loop false
| Start _ -> failwith "Unexpected start message!"
| Url _ -> failwith "Unexpected URL message!"
| Done -> printfn "[Supervisor] Supervisor is done."
(x :> IDisposable).Dispose()
// Notify the caller that the agent has completed
repl.Reply(())
}
do! loop true })
let urlCollector =
MailboxProcessor.Start(fun y ->
let rec loop count =
async {
System.Threading.Thread.CurrentThread.IsBackground <- false
let! msg = y.TryReceive(6000)
match msg with
| Some message ->
match message with
| Url u ->
match u with
| Some url -> q.Enqueue url
return! loop count
| None -> return! loop count
| _ ->
match count with
| Gate -> (y :> IDisposable).Dispose()
printfn "[urlCollector] URL collector is done."
supervisor.Post Done
| _ -> return! loop (count + 1)
| None -> supervisor.Post Stop
return! loop count
}
loop 1)
/// Initializes a crawling agent.
let crawler id =
MailboxProcessor.Start(fun inbox ->
let rec loop() =
async {
System.Threading.Thread.CurrentThread.IsBackground <- false
let! msg = inbox.Receive()
match msg with
| Url x ->
match x with
| Some url ->
let links = collectLinks url
printfn "%s crawled by agent %d." url id
for link in links do
urlCollector.Post <| Url (Some link)
supervisor.Post(Mailbox(inbox))
return! loop()
| None -> supervisor.Post(Mailbox(inbox))
return! loop()
| _ -> printfn "Agent %d is done." id
urlCollector.Post Done
(inbox :> IDisposable).Dispose()
}
loop())
// Send 'Start' message to the main agent. The result
// is asynchronous workflow that will complete when the
// agent crawling completes
let result = supervisor.PostAndAsyncReply(Start)
// Spawn the crawlers.
let crawlers =
[
for i in 1 .. Gate do
yield crawler i
]
// Post the first messages.
crawlers.Head.Post <| Url (Some url)
crawlers.Tail |> List.iter (fun ag -> ag.Post <| Url None)
printfn "[Crawl] before return result"
result
// Example:
printfn "[Main] before crawl"
crawl "http://news.google.com" 5
|> Async.RunSynchronously
printfn "[Main] after crawl"
I think I've sort of solved the problem: adding System.Threading.Thread.CurrentThread.IsBackground <- false
after the let!
in the printer agent.
However, I tried to modify the original code (the first version before Tomas' AsyncChannel fix) by adding System.Threading.Thread.CurrentThread.IsBackground <- false
after all the let!
and it still doesn't work. No idea.
Thanks everyone for your help. I finally can start my first F# application for a batch process. I think MailboxProcessor should really set IsBackground to false by default. Anyway to ask Microsoft to change it.
[Update] Just found out that the compiled assembly works well. But fsi --user:Program --exec --quiet
is still the same. It seems a bug of fsi?