Webcrawler, feedback?

Hey folks, every once in a while I have the need to automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services).

Anyways, as follow up to my previous question I've written a little webcrawler that can visit websites.

Basic crawler class to easily and quickly interact with one website.

Override "doAction(String URL, String content)" to process the content further (e.g. store it, parse it).

Concept allows for multi-threading of crawlers. All class instances share processed and queued lists of links.

Instead of keeping track of processed links and queued links within the object, a JDBC connection could be established to store links in a database.

Currently limited to one website at a time, however, could be expanded upon by adding an externalLinks stack and adding to it as appropriate.

JCrawler is intended to be used to quickly generate XML sitemaps or parse websites for your desired information. It's lightweight.

Is this a good/decent way to write the crawler, provided the limitations above? Any input would help immensely :)

http://pastebin.com/VtgC4qVE - Main.java
http://pastebin.com/gF4sLHEW - JCrawler.java
http://pastebin.com/VJ1grArt - HTMLUtils.java

Solution

I have written a custom web-crawler in my company and I follow similar steps as you have mentioned and I found them perfect.The only add-on I want to say is that it should have a polling frequency to crawl after certain period of time.

So it should follow "Observer" design pattern so that if any new update is found on a given url after certain period of time then it will update or write to a file.