I'm writing a scraper using FSharp.Collections.ParallelSeq
and a retry computation. I would like to retrieve HTML from multiple pages in parallel, and I would like to retry requests when they fail.
For example:
open System
open FSharp.Collections.ParallelSeq
type RetryBuilder(max) =
member x.Return(a) = a // Enable 'return'
member x.Delay(f) = f // Gets wrapped body and returns it (as it is)
// so that the body is passed to 'Run'
member x.Zero() = failwith "Zero" // Support if .. then
member x.Run(f) = // Gets function created by 'Delay'
let rec loop(n) =
if n = 0 then failwith "Failed" // Number of retries exceeded
else try f() with _ -> loop(n-1)
loop max
let retry = RetryBuilder(4)
let getHtml (url : string) = retry {
Console.WriteLine("Get Url")
return 0;
}
//A property/field?
let GetHtmlForAllPages =
let pages = {1 .. 10}
let allHtml = pages |> PSeq.map(fun x -> getHtml("http://somesite.com/" + x.ToString())) |> Seq.toArray
allHtml
[<EntryPoint>]
let main argv =
let htmlForAllPages = GetHtmlForAllPages
0 // return an integer exit code
When I try to interact with GetHtmlForAllPages
from main
the code seems to hang. Stepping through the code shows me that PSeq.map
begins work on the first four values of pages
.
What's going on that causes the retry
computation expression to never start/complete? Is there some weird interplay between PSeq
and retry
?
The code works as expected if I make GetHtmlForAllPages
a function and invoke it. I'm curious what's going on when GetHtmlForAllPages
is a field?
Looks like you're deadlocking within a static constructor. The scenario is described here:
The CLR uses an internal lock to ensure that static constructor:
- is only called once
- gets executed before creation of any instance of the class or before accessing any static members.
With this behaviour of CLR, there is a potential opportunity of a deadlock if we perform any asynchronous blocking operation in a static constructor. (...)
The main thread will wait for the helper thread to complete within the static constructor. Since the helper thread is accessing the instance method, it will first try to acquire the internal lock. As internal lock is already acquired by the main thread, we will end-up in a deadlock situation.
Using Parallel LINQ (or any other similar library like FSharp.Collections.ParallelSeq) in a static constructor will make you run into that problem.
Unfortunately, a static constructor of a compiler-generated class is what you get for your GetHtmlForAllPages
value. From ILSpy (with C# formatting):
namespace <StartupCode$ConsoleApplication1>
{
internal static class $Program
{
[DebuggerBrowsable(DebuggerBrowsableState.Never)]
internal static readonly Program.RetryBuilder retry@17;
[DebuggerBrowsable(DebuggerBrowsableState.Never)]
internal static readonly int[] GetHtmlForAllPages@24;
[DebuggerBrowsable(DebuggerBrowsableState.Never), DebuggerNonUserCode, CompilerGenerated]
internal static int init@;
static $Program()
{
$Program.retry@17 = new Program.RetryBuilder(4);
IEnumerable<int> pages = Operators.OperatorIntrinsics.RangeInt32(1, 1, 10);
ParallelQuery<int> parallelQuery = PSeqModule.map<int, int>(new Program.allHtml@26(), pages);
ParallelQuery<int> parallelQuery2 = parallelQuery;
int[] allHtml = SeqModule.ToArray<int>((IEnumerable<int>)parallelQuery2);
$Program.GetHtmlForAllPages@24 = allHtml;
}
}
}
and in your actual Program
class:
[CompilationMapping(SourceConstructFlags.Value)]
public static int[] GetHtmlForAllPages
{
get
{
return $Program.GetHtmlForAllPages@24;
}
}
That's where the deadlock is coming from.
As soon as you change GetHtmlForAllPages
to be a function (by adding ()
) it is no longer part of that static constructor, which makes the program work as expected.