I would like to just crawl with crawler4j
, certain URLs which have a certain prefix.
So for example, if an URL starts with http://url1.com/timer/image
it is valid. E.g.: http://url1.com/timer/image/text.php
.
This URL is not valid: http://test1.com/timer/image
I tried to implement it like that:
public boolean shouldVisit(Page page, WebURL url) {
String href = url.getURL().toLowerCase();
String adrs1 = "http://url1.com/timer/image";
String adrs2 = "http://url2.com/house/image";
if (!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))) {
return false;
}
if (filters.matcher(href).matches()) {
return false;
}
for (String crawlDomain : myCrawlDomains) {
if (href.startsWith(crawlDomain)) {
return true;
}
}
return false;
}
However, it does not seem that this works, because the crawler also visits other URLs.
Any recommendation what I could so?
I appreciate your answer!
Basically you can have an array of prefixes which holds allowed URLs which you want to crawl. And inside your method just travers the array return true if only it machetes with any of your allowed prefix. That means you dont have to list any domains which you don't want to crawl.
public boolean shouldVisit(Page page, WebURL url) {
String href = url.getURL().toLowerCase();
// prefixes that you want to crawl
String allowedPrefixes[] = {"http://url1.com", "http://url2.com"};
for (String allowedPrefix : allowedPrefixes) {
if (href.startsWith(allowedPrefix)) {
return true;
}
}
return false;
}
Your code is not working because your condition is incorrect:
(!(href.startsWith(adrs1)) || !(href.startsWith(adrs2))
Another reason is you might not have configured crawlerDomains
. It is configured during startup of your application by calling CrawlController#setCustomData(crawler1Domains);
Look at sample source code of crawler4j, crawlerDomains are set here: MultipleCrawlerController.java#79