search web search-engine business-intelligence

How do websites like torrentz.eu collect their content?

I would like to know how some search website get their content. I have used in the title the example of 'torrentz.eu' because it has content from several sources. I would like to know what is behind this system; do they 'simply' parse all the website they support and then show the content? Or using some web service? Or both?

Solution

You are looking for the Crawling aspect of Information Retrieval.

Basically crawling is: Given an initial set S of websites, try to expand it by exploring the links (Find the transitive closure¹).

Some web sites also used focused crawlers, if they try to index only a subset of the web from the first place.

P.S. Some website do neither, and use the service provided by Google Custom Search API/Yahoo Boss/Bing Deveoper APIs (for a fee, of course), and use their index, instead of creating one by their own.

P.P.S This is providing a theoretic approach how one can do it, I have no idea how the mentioned website actually works.

(1) Due to time issues, the transitive closure is usually not found, but something close enough to it.