Crawl urls with a certain prefix

I would like to just crawl with crawler4j, certain URLs which have a certain prefix.

So for example, if an URL starts with http://url1.com/timer/image it is valid. E.g.: http://url1.com/timer/image/text.php.

This URL is not valid: http://test1.com/timer/image

I tried to implement it like that:

public boolean shouldVisit(Page page, WebURL url) {
    String href = url.getURL().toLowerCase();
    String adrs1 = "http://url1.com/timer/image";
    String adrs2 = "http://url2.com/house/image";

    if (!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))) {
        return false;
    }

    if (filters.matcher(href).matches()) {
        return false;
    }

    for (String crawlDomain : myCrawlDomains) {
        if (href.startsWith(crawlDomain)) {
            return true;
        }
    }

    return false;
}

However, it does not seem that this works, because the crawler also visits other URLs.

Any recommendation what I could so?

I appreciate your answer!

Solution

Basically you can have an array of prefixes which holds allowed URLs which you want to crawl. And inside your method just travers the array return true if only it machetes with any of your allowed prefix. That means you dont have to list any domains which you don't want to crawl.

public boolean shouldVisit(Page page, WebURL url) {
    String href = url.getURL().toLowerCase();
    // prefixes that you want to crawl
    String allowedPrefixes[] = {"http://url1.com", "http://url2.com"};

    for (String allowedPrefix : allowedPrefixes) {
        if (href.startsWith(allowedPrefix)) {
            return true;
        }
     }

    return false;
}

Your code is not working because your condition is incorrect:

(!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))

Another reason is you might not have configured crawlerDomains. It is configured during startup of your application by calling CrawlController#setCustomData(crawler1Domains);

Look at sample source code of crawler4j, crawlerDomains are set here: MultipleCrawlerController.java#79