I'm building a deal aggregator so I need a crawler that will extract data from some sites: price, discount, image, coordinates and name of deal of cource.
Do you know of any tutorials, ebooks or something that will help me? For image and coordinates and discount I have a solution and pattern:
How do I get the following items?
Do you know of any data extraction algorithms that can be helpful?
I'd suggest you to use XPath based scraper. For example Web-Harvest
Or, if you want to analyze raw texts, I'd suggest using state-machine parser for recognizing templated parts of texts.
Look at this topic: Are there APIs for text analysis/mining in Java?