Search code examples
javaoptimizationweb-crawlerfeedback

Webcrawler, feedback?


Hey folks, every once in a while I have the need to automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services).

Anyways, as follow up to my previous question I've written a little webcrawler that can visit websites.

  • Basic crawler class to easily and quickly interact with one website.

  • Override "doAction(String URL, String content)" to process the content further (e.g. store it, parse it).

  • Concept allows for multi-threading of crawlers. All class instances share processed and queued lists of links.

  • Instead of keeping track of processed links and queued links within the object, a JDBC connection could be established to store links in a database.

  • Currently limited to one website at a time, however, could be expanded upon by adding an externalLinks stack and adding to it as appropriate.

  • JCrawler is intended to be used to quickly generate XML sitemaps or parse websites for your desired information. It's lightweight.

Is this a good/decent way to write the crawler, provided the limitations above? Any input would help immensely :)

http://pastebin.com/VtgC4qVE - Main.java
http://pastebin.com/gF4sLHEW - JCrawler.java
http://pastebin.com/VJ1grArt - HTMLUtils.java


Solution

  • I have written a custom web-crawler in my company and I follow similar steps as you have mentioned and I found them perfect.The only add-on I want to say is that it should have a polling frequency to crawl after certain period of time.

    So it should follow "Observer" design pattern so that if any new update is found on a given url after certain period of time then it will update or write to a file.