I am using gocolly for harvesting data from my website, the challenge is, gocolly is too aggressive when crawling the URLs. I have added a RandomDelay
Update
Based on the answer I changed
c.Limit(&colly.LimitRule{
RandomDelay: 10 * time.Second,
})
To
c.Limit(&colly.LimitRule{
RandomDelay: 10 * time.Second,
Parallelism: 2,
DomainGlob: "*mysite*",
})
But when it crawls it does it in less than a few seconds:
Original output
2021/02/04 08:17:33 Visiting https://www....
2021/02/04 08:17:33 Visiting https://www....
2021/02/04 08:17:34 Visiting https://www....
2021/02/04 08:17:34 Visiting https://www....
2021/02/04 08:17:34 Visiting https://www....
2021/02/04 08:17:34 Visiting https://www....
Output after the update
2021/02/04 09:37:00 Visiting https://www...
2021/02/04 09:37:07 Visiting https://www...
2021/02/04 09:37:16 Visiting https://www...
What I am looking for is a way to ensure that gocolly doesn't crawl these pages any faster than e.g. 5-10 seconds pr page. The reason is, I don't want to see a spike in performance on my site each time gocolly runs.
Adding a time.Sleep could be an option, but I'd rather use gocolly Limit() if possible.
You have forgot to set the DomainGlob
parameter:
c.Limit(&colly.LimitRule{
DomainGlob: "*",
//Parallelism: 2,
//Delay: 5 * time.Second,
})