I'm new to Golang, trying to build a system that fetches content from a set of urls and extract specific lines with regex. The problems start when i wrap the code with goroutines. I'm getting a different number of regex results and many of fetched lines are duplicates.
max_routines := 3
sem := make(chan int, max_routines) // to control the number of working routines
var wg sync.WaitGroup
ch_content := make(chan string)
client := http.Client{}
for i:=2; ; i++ {
// for testing
if i>5 {
break
}
// loop should be broken if feebbacks_checstr is found in content
if loop_break {
break
}
wg.Add(1)
go func(i int) {
defer wg.Done()
sem <- 1 // will block if > max_routines
final_url = url+a.tm_id+"/page="+strconv.Itoa(i)
resp, _ := client.Get(final_url)
var bodyString string
if resp.StatusCode == http.StatusOK {
bodyBytes, _ := ioutil.ReadAll(resp.Body)
bodyString = string(bodyBytes)
}
// checking for stop word in content
if false == strings.Contains(bodyString, feebbacks_checstr) {
res2 = regex.FindAllStringSubmatch(bodyString,-1)
for _,v := range res2 {
ch_content <- v[1]
}
} else {
loop_break = true
}
resp.Body.Close()
<-sem
}(i)
}
for {
select {
case r := <-ch_content:
a.feedbacks = append(a.feedbacks, r) // collecting the data
case <-time.After(500 * time.Millisecond):
show(len(a.feedbacks)) // < always different result, many entries in a.feedbacks are duplicates
fmt.Printf(".")
}
}
As a result len(a.feedbacks) gives sometimes 130, sometimes 139 and a.feedbacks contains duplicates. If i clean the duplicates the number of results is about half of what i'm expecting (109 without duplicates)
You're creating a closure by using an anonymous go routine function. I notice your final_url
isn't :=
but =
which means it's defined outside the closure. All go routines will have access to the same value of final_url
and there's a race condition going on. Some go routines are overwriting final_url
before other go routines are making their requests and this will result in duplicates.
If you define final_url
inside the go routine then they won't be stepping on each other's toes and it should work as you expect.
That's the simple fix for what you have. A more idiomatically Go way to do this would be to create an input channel (containing the URLs to request) and an output channel (eventually containing whatever you're pulling out of the response) and instead of trying to manage the life and death of dozens of go routines you would keep alive a constant amount of go routines that try to empty out the input channel.