Search code examples
regexalgorithmdata-extraction

Smart data extraction algorithm from websites


I'm building a deal aggregator so I need a crawler that will extract data from some sites: price, discount, image, coordinates and name of deal of cource.

Do you know of any tutorials, ebooks or something that will help me? For image and coordinates and discount I have a solution and pattern:

  • image: biggest image is always the main image of deal
  • discount: discount is always a number between 50 and 99 and always has a "%" symbol
  • coordinates: is always in decimal numbers so I get it with regex

How do I get the following items?

  • Name of deal?
  • Price?

Do you know of any data extraction algorithms that can be helpful?


Solution

  • I'd suggest you to use XPath based scraper. For example Web-Harvest

    Or, if you want to analyze raw texts, I'd suggest using state-machine parser for recognizing templated parts of texts.

    Look at this topic: Are there APIs for text analysis/mining in Java?