Search code examples
searchsearch-engineweb-crawler

Are there any building blocks for a search engine that will scrape other sites?


I want build a search service for one particular thing. The data is freely available out there, via free classified services, and a host of other sites.

Are there any building blocks, e.g. open-source crawlers that I would customize - rather than build from scratch, that I can use?

Any advice on building such a product? Not just technical, but any privacy/legal things that I might need to take into consideration.

E.g. do I need to 'give credit' where the results are from and put a link to the original - if I get them from many places?

Edit: By the way, I am using GWT with JS for the front-end, haven't decided on the language for the back-end. Either PHP or Python. Thoughts?


Solution

  • There are few blocks in python you can use.

    1. beautifulsoup [http://www.crummy.com/software/BeautifulSoup/] for parsing HTML. It can handle bad code too, and its API is veeery easy... way better than any DOM-like tool for me. My friend used it to scrape his old phpbb forum with success. It has pretty good docs.
    2. mechanize [http://wwwsearch.sourceforge.net/mechanize/] is a webbrowser-simulating http client library. It handles cookies, filling forms and so on. Also easy to use, but it helps if you understand how does http work.
    3. http://dev.scrapy.org/ -- this is a relatively new thing: a whole scraping framework based on twisted. I haven't played with it much.

    I use first two for my needs; f.e. it needs 20 lines of code to get an automatic testing tool for a 3-stage poll, with simulation of waiting for user entering data and so on.