Search code examples
web-crawler

Make a web crawler/spider


I'm looking into making a web crawler/spider but I need someone to point me in the right direction to get started.

Basically, my spider is going to search for audio files and index them.

I'm just wondering if anyone has any ideas for how I should do it. I've heard having it done in PHP would be extremely slow. I know vb.net so could that come in handy?

I was thinking about using Googles filetype search to get links to crawl. Would that be ok?


Solution

  • In VB.NET you will need to get the HTML first, so use the WebClient class or HttpWebRequest and HttpWebResponse classes. There is plenty of info on how to use these on the interweb.

    Then you will need to parse the HTML. I recommend using regular expressions for this.

    Your idea of using Google for a filetype search is a good one. I did a similar thing a few years ago to gather PDFs to test PDF indexing in SharePoint, which worked really well.