I've been reading on how to implement a crawler. I understand that we start with a list of URLs to visit ( seeds list ). Visit all those URLs and add all the links in the visited pages to the list (frontier). So how much should I add to this seed list? Do I just have to add as much URLs as I can and hope that they'll get me to as much as URLs on the www, and does that actually guarantee that I would get all of other URLs there? Or there is some convention to do this? I mean ... what does a search engine like Google do?
It's basically that, they make a big list of web sites using the connections (links) between them. The more web sites your search engine knows, the better. The only issue here is being able to make this list useful. That is, a big list of website possibilities does not mean a good result set to a search, so you have to be able to tell what's important in each web page.
But according to the information processing power you have, there's no need to stop somewhere.
That does not ensure you'll reach every single URL out there, but it's basically the only practical way to crawl the web.