Search code examples
f#ctp

Expert F# web crawler example


I'm trying to work through an example in Expert F#, which is based on v1.9.2, but the CTP releases after that have changed enough that some of them don't even compile anymore.

I'm running into some trouble with listing 13-13. Here's the snippet of the urlCollector object definition:

let urlCollector =
    MailboxProcessor.Start(fun self ->
        let rec waitForUrl (visited : Set<string>) =
            async { if visited.Count < limit then
                        let! url = self.Receive()
                        if not (visited.Contains(url)) then
                            do! Async.Start
                                (async { let! links = collectLinks url
                                         for link in links do
                                         do self <-- link })

                        return! waitForUrl(visited.Add(url)) }

            waitForUrl(Set.Empty))

I'm compiling with Version 1.9.6.16, and the compiler complains thusly:

  1. incomplete structured construct at or before this point in expression [after the last paren]
  2. error in the return expression for this 'let'. Possible incorrect indentation [refers to the let defining waitForUrl]

Can anyone spot what's going wrong here?


Solution

  • It looks like the last line needs to be unindented 4 spaces.

    EDIT: actually, it looks like there's more going on here. Assuming this is the same sample as here, then here's a version I just modified to be in sync with the 1.9.6.16 release:

    open System.Collections.Generic
    open System.Net
    open System.IO
    open System.Threading
    open System.Text.RegularExpressions
    
    let limit = 10    
    
    let linkPat = "href=\s*\"[^\"h]*(http://[^&\"]*)\""
    let getLinks (txt:string) =
        [ for m in Regex.Matches(txt,linkPat)  -> m.Groups.Item(1).Value ]
    
    let (<--) (mp: MailboxProcessor<_>) x = mp.Post(x)
    
    // A type that helps limit the number of active web requests
    type RequestGate(n:int) =
        let semaphore = new Semaphore(initialCount=n,maximumCount=n)
        member x.AcquireAsync(?timeout) =
            async { let! ok = semaphore.AsyncWaitOne(?millisecondsTimeout=timeout)
                    if ok then
                       return
                         { new System.IDisposable with
                             member x.Dispose() =
                                 semaphore.Release() |> ignore }
                    else
                       return! failwith "couldn't acquire a semaphore" }
    
    // Gate the number of active web requests
    let webRequestGate = RequestGate(5)
    
    // Fetch the URL, and post the results to the urlCollector.
    let collectLinks (url:string) =
        async { // An Async web request with a global gate
                let! html =
                    async { // Acquire an entry in the webRequestGate. Release
                            // it when 'holder' goes out of scope
                            use! holder = webRequestGate.AcquireAsync()
    
                            // Wait for the WebResponse
                            let req = WebRequest.Create(url,Timeout=5)
    
                            use! response = req.AsyncGetResponse()
    
                            // Get the response stream
                            use reader = new StreamReader(
                                response.GetResponseStream())
    
                            // Read the response stream
                            return! reader.AsyncReadToEnd()  }
    
                // Compute the links, synchronously
                let links = getLinks html
    
                // Report, synchronously
                do printfn "finished reading %s, got %d links" 
                        url (List.length links)
    
                // We're done
                return links }
    
    let urlCollector =
        MailboxProcessor.Start(fun self ->
            let rec waitForUrl (visited : Set<string>) =
                async { if visited.Count < limit then
                            let! url = self.Receive()
                            if not (visited.Contains(url)) then
                                Async.Start 
                                    (async { let! links = collectLinks url
                                             for link in links do
                                                 do self <-- link })
                            return! waitForUrl(visited.Add(url)) }
    
            waitForUrl(Set.Empty))
    
    urlCollector <-- "http://news.google.com"
    // wait for keypress to end program
    System.Console.ReadKey() |> ignore