Search code examples
webweb-crawlerrapidminermining

Rapid Miner Not Saving Crawl Web Results


I am trying to crawl review for a particular movie review from IMDB website. For this I am using crawl web which i have embedded inside loop as there are 74 pages.

Attached are the images of configuration. Please help. Am badly stuck in this.

URL for Crawl Web is: http://www.imdb.com/title/tt0454876/reviews?start=%{pagePos}

enter image description here


Solution

  • When I tried it, I got 403 forbidden errors because the IMDB service thinks I am a robot. Using Loop with Crawl Web is bad practice because the Loop operator does not implement any waiting.

    This process can be reduced to just the Crawl Web operator. The key parameters are:

    • URL - set this to http://www.imdb.com/title/tt0454876
    • max pages - set this to 79 or whatever number you need
    • max page size - set this to 1000
    • crawling rules - set these to the ones you specified
    • output dir - choose a folder to store things in

    This works because the crawl operator will work out all possible URLs that match the rules and will store those that also match. The visits will be delayed by 1000 ms (the delay parameter) to avoid triggering a robot exclusion at the server.

    Hope this gets you going as a start.