Search code examples
javaweb-crawlerbroken-links

How can I search for broken links of a website using Java?


I would like to scan some websites looking for broken links, preferably using Java. Any hint how can I start doing this?

(I know there are some websites that do this, but I want to make my own personalized log file)


Solution

  • Writing a web-crawler isn't as simple as just reading the static HTML, if the page uses JavaScript to modify the DOM then it gets complex. You will also need to look for pages you've already visited aka Spider Traps? If the site is pure static HTML, then go for it... But if the site uses Jquery and is large, expect it to be complex.

    If your site is all static, small and has little or no JS then use the answers already listed.

    Or

    You could use Heritrix and then later parsed it's crawl.log for 404's. Heritrix doc on crawl.log

    Or If you most write your own:

    You could use something like HTMLUnit (it has a JavaScript engine) to load the page, then query the DOM object for links. Then place each link in a "unvisited" queue, then pull links from the unvisited queue to get your next url to load, if the page fails to load, report it.

    To avoid duplicate pages (spider traps) you could hash each link and keep a HashTable of visited pages (see CityHash ). Before placing a link into the unvisited queue check it against the visited hashtable.

    To avoid leaving your site check that the URL is in a safe domain list before adding it to the unvisited queue. If you want to confirm that the off domain links are good, then keep them in a offDomain queue. Then later load each link from this queue using URL.getContent(url) to see if they work (faster than using HTMLUnit and you don't need to parse the page anyway.).