Search code examples
searchwebsearch-enginebusiness-intelligence

How do websites like torrentz.eu collect their content?


I would like to know how some search website get their content. I have used in the title the example of 'torrentz.eu' because it has content from several sources. I would like to know what is behind this system; do they 'simply' parse all the website they support and then show the content? Or using some web service? Or both?


Solution

  • You are looking for the Crawling aspect of Information Retrieval.

    Basically crawling is: Given an initial set S of websites, try to expand it by exploring the links (Find the transitive closure1).

    Some web sites also used focused crawlers, if they try to index only a subset of the web from the first place.

    P.S. Some website do neither, and use the service provided by Google Custom Search API/Yahoo Boss/Bing Deveoper APIs (for a fee, of course), and use their index, instead of creating one by their own.

    P.P.S This is providing a theoretic approach how one can do it, I have no idea how the mentioned website actually works.


    (1) Due to time issues, the transitive closure is usually not found, but something close enough to it.