well what I'm actually trying to do is to figure out how BEEMP3.COM works.
Because of the site's speed, I doubt they scrape other sites/sources on the spot. They probably use some sort of database (PostgreSQL or MySQL) to store the "results" and then just query the search terms.
My question is how do you guys think they crawl/spider or actually get the mp3 files/content? They must have some algorithm to spider the internet OR use google's index of mp3 trick to find hosts with the raw mp3 files.
Any comments and tips or ideas are appreciated :)
QueryPath is a great tool for building a web spider.
I'm guessing they find MP3s using a combination approach - they have a list of "seed sites" (gathered from Google, Usenet or manually inserted) that they use as a starting points for the search and then set spiders running against them.
You need to write a script that will:
You'll also need to re-check your MP3 links regularly to erase any bad links.