I am trying to crawl review for a particular movie review from IMDB website. For this I am using crawl web which i have embedded inside loop as there are 74 pages.
Attached are the images of configuration. Please help. Am badly stuck in this.
URL for Crawl Web is: http://www.imdb.com/title/tt0454876/reviews?start=%{pagePos}
When I tried it, I got 403 forbidden
errors because the IMDB service thinks I am a robot. Using Loop
with Crawl Web
is bad practice because the Loop
operator does not implement any waiting.
This process can be reduced to just the Crawl Web
operator. The key parameters are:
This works because the crawl operator will work out all possible URLs that match the rules and will store those that also match. The visits will be delayed by 1000 ms (the delay parameter) to avoid triggering a robot exclusion at the server.
Hope this gets you going as a start.