I have a list of urls that I am scraping. What I want to do is store all of the successfully scraped page data into a channel, and when I am done, dump it into a slice. I don't know how many successful fetches I will get, so I cannot specify a fixed length. I expected the code to reach wg.Wait()
and then wait until all the wg.Done()
methods are called, but I never reached the close(queue)
statement. Looking for a similar answer, I came across this SO answer
https://stackoverflow.com/a/31573574/5721702
where the author does something similar:
ports := make(chan string)
toScan := make(chan int)
var wg sync.WaitGroup
// make 100 workers for dialing
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for p := range toScan {
ports <- worker(*host, p)
}
}()
}
// close our receiving ports channel once all workers are done
go func() {
wg.Wait()
close(ports)
}()
As soon as I wrapped my wg.Wait()
inside the goroutine, close(queue)
was reached:
urls := getListOfURLS()
activities := make([]Activity, 0, limit)
queue := make(chan Activity)
for i, activityURL := range urls {
wg.Add(1)
go func(i int, url string) {
defer wg.Done()
activity, err := extractDetail(url)
if err != nil {
log.Println(err)
return
}
queue <- activity
}(i, activityURL)
}
// calling it like this without the goroutine causes the execution to hang
// wg.Wait()
// close(queue)
// calling it like this successfully waits
go func() {
wg.Wait()
close(queue)
}()
for a := range queue {
// block channel until valid url is added to queue
// once all are added, close it
activities = append(activities, a)
}
Why does the code not reach the close
if I don't use a goroutine for wg.Wait()
? I would think that the all of the defer wg.Done()
statements are called so eventually it would clear up, because it gets to the wg.Wait()
. Does it have to do with receiving values in my channel?
You need to wait for goroutines to finish in a separate thread because queue
needs to be read from. When you do the following:
queue := make(chan Activity)
for i, activityURL := range urls {
wg.Add(1)
go func(i int, url string) {
defer wg.Done()
activity, err := extractDetail(url)
if err != nil {
log.Println(err)
return
}
queue <- activity // nothing is reading data from queue.
}(i, activityURL)
}
wg.Wait()
close(queue)
for a := range queue {
activities = append(activities, a)
}
Each goroutine blocks at queue <- activity
since queue
is unbuffered and nothing is reading data from it. This is because the range loop on queue
is in the main thread after wg.Wait
.
wg.Wait
will only unblock once all the goroutine return. But as mentioned, all the goroutines are blocked at channel send.
When you use a separate goroutine to wait, code execution actually reaches the range loop on queue
.
// wg.Wait does not block the main thread.
go func() {
wg.Wait()
close(queue)
}()
This results in the goroutines unblocking at the queue <- activity
statement (main thread starts reading off queue
) and running until completion. Which in turn calls each individual wg.Done
.
Once the waiting goroutine get past wg.Wait
, queue
is closed and the main thread exits the range loop on it.