Advice/Tips on what the best way to spider/crawl/collect audio content from the internet

well what I'm actually trying to do is to figure out how BEEMP3.COM works.

Because of the site's speed, I doubt they scrape other sites/sources on the spot. They probably use some sort of database (PostgreSQL or MySQL) to store the "results" and then just query the search terms.

My question is how do you guys think they crawl/spider or actually get the mp3 files/content? They must have some algorithm to spider the internet OR use google's index of mp3 trick to find hosts with the raw mp3 files.

Any comments and tips or ideas are appreciated :)

Solution

QueryPath is a great tool for building a web spider.

I'm guessing they find MP3s using a combination approach - they have a list of "seed sites" (gathered from Google, Usenet or manually inserted) that they use as a starting points for the search and then set spiders running against them.

You need to write a script that will:

Take a webpage as a starting point
Fetch the webpage data (use cURL)
Use a regular expression to extract (a) any links (b) any links to mp3 files
Place any MP3 links into a database
Add the list of links to other webpages to a queue for processing through the above method

You'll also need to re-check your MP3 links regularly to erase any bad links.