java python web-crawler web-scraping web-mining

Web mining or scraping or crawling? What tool/library should I use?

I want to crawl and save some webpages as HTML. Say, crawl into hundreds popular websites and simply save their frontpages and the "About" pages.

I've looked into many questions, but didn't find an answer to this from either web crawling or web scraping questions.

What library or tool should I use to build the solution? Or is there even some existing tools that can handle this?

Solution

There really is no good solution here. You are right as you suspect that Python is probably the best way to start because of it's incredibly strong support of regular expression.

In order to implement something like this, strong knowledge of SEO (Search Engine Optimization) would help since effectively optimizing a webpage for search engines tells you how search engines behave. I would start with a site like SEOMoz.

As far as identifying the "about us" page, you only have 2 options:

a) For each page get the link of the about us page and feed it to your crawler.

b) Parse all the links of the page for certain keywords like "about us", "about" "learn more" or whatever.

in using option b, be careful as you could get stuck in an infinite loop since a website will link to the same page many times especially if the link is in the header or footer a page may link back to itself even. To avoid this you'll need to create a list of visited links and make sure not to revisit them.

Finally, I would recommend having your crawler respect instructions in the robot.txt file and it's probably a great idea not to follow links marked rel="nofollow" as these are mostly used on external links. Again, learn this and more by reading up on SEO.

Regards,