Search code examples
javaweb-crawlergoogle-crawlerscrawler4j

Crawler4j, Some urls are crawled without issue while others are not crawled at all


I have been playing around with Crawler4j and have successfully had it crawl some pages but have no success crawling others. For example I have gotten it to successfully crawl Reddi with this code:

public class Controller {
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "//home/user/Documents/Misc/Crawler/test";
        int numberOfCrawlers = 1;

        CrawlConfig config = new CrawlConfig();
       config.setCrawlStorageFolder(crawlStorageFolder);

        /*
         * Instantiate the controller for this crawl.
         */
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        /*
         * For each crawl, you need to add some seed urls. These are the first
         * URLs that are fetched and then the crawler starts following links
         * which are found in these pages
         */
        controller.addSeed("https://www.reddit.com/r/movies");
        controller.addSeed("https://www.reddit.com/r/politics");


        /*
         * Start the crawl. This is a blocking operation, meaning that your code
         * will reach the line after this only when crawling is finished.
         */
        controller.start(MyCrawler.class, numberOfCrawlers);
    }


}

And with:

@Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("https://www.reddit.com/");
 }

in MyCrawler.java. However when I have tried to crawl http://www.ratemyprofessors.com/ the program just hangs without output and does not crawl anything. I use the following code like above, in myController.java:

controller.addSeed("http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222");
controller.addSeed("http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044");

And in MyCrawler.java:

 @Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("http://www.ratemyprofessors.com/");
 }

So I am wondering:

  • Are some servers able to recognize crawlers right away and not allow them to collect data?
  • I noticed that the RateMyProfessor pages are .jsp format; could this have anything to do with it?
  • Are there any ways in which I could debug this better? The console does not output anything.

Solution

  • crawler4j respects crawler politness such as the robots.txt. In your case this file is the following one.

    Inspecting this file reveals, that it is disallowed to crawl your given seed points:

     Disallow: /ShowRatings.jsp 
     Disallow: /campusRatings.jsp 
    

    This theory is supported by the crawler4j log output:

    2015-12-15 19:47:18,791 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222
    2015-12-15 19:47:18,793 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044