Search code examples
c#.netdnssearch-engine

How can I get a database of valid URLs for my search engine?


I'm trying to make an Internet Search Engine for school, with no more than C# and the .NET framework. I need to download the HTML code of the pages I'm indexing.

Now all it takes is to have a list of valid URLs.

Since I don't have a database of valid URLs, I made a trial and error algorithm, which grows a string:

a, b, c.....
aa, ab, ac......
aaa, aab, aac......
aaaa, aaab, aaac......
aaaaa, aaaab, aaaac......

and then tries to concatenate with .com, .net or whatever. This is too inefficient.

I need a database with valid URLs. Do you know where I can get one?

I can't work out how to get them straight out of DNS - is this something that's possible?


Solution

  • You can build your own. Most search engines crawl pages and follow links to other pages.

    You start with a known list (it doesn't have to be very big) then:

    1. Access a page in your list
    2. Find links on those pages
    3. Add those links to your list
    4. Go to 1

    As for using DNS; it's not designed to query URLs, only hostnames. And, as far as I know, you can't get a list of every hostname from a DNS server unless you manage the server yourself.