Smart data extraction algorithm from websites

I'm building a deal aggregator so I need a crawler that will extract data from some sites: price, discount, image, coordinates and name of deal of cource.

Do you know of any tutorials, ebooks or something that will help me? For image and coordinates and discount I have a solution and pattern:

image: biggest image is always the main image of deal
discount: discount is always a number between 50 and 99 and always has a "%" symbol
coordinates: is always in decimal numbers so I get it with regex

How do I get the following items?

Name of deal?
Price?

Do you know of any data extraction algorithms that can be helpful?

Solution

I'd suggest you to use XPath based scraper. For example Web-Harvest

Or, if you want to analyze raw texts, I'd suggest using state-machine parser for recognizing templated parts of texts.

Look at this topic: Are there APIs for text analysis/mining in Java?

Issue with Regex Pattern (split string with characters & numbers)
How to get the first Tamil letter in a word?
The method of using Regular Expression to express Date and Time: YYYY-MM-DD HH:MM:SS.XXX
validate the credit card expiry date using java?
Using disjunction (OR) in a lookbehind
See if a string contains any characters in it
Multibyte trim in PHP?
match only the last instance of a pattern with Javascript regexp
Regex to detect negative numbers but not hyphenated numbers
How to Fix regular expression capturing group error Bigquery?
Improving the below RegEx for US and UK Names
R quanteda kwic not matching negative look behind pattern
Help with drivers license number validation regex
Regex for matching grey colors in hexadecimal notation
Extract email address from string
Regex for ISO 8601 durations
How to find second match in sql
Regular expression to remove HTML tags
How to replace host part of a URL using javascript regex
How do I match all four cases in Regex?
How to replace a period that is between letters, but not numbers?
How to strip quotes from a variable in Ansible?
Regular expression to accept negative number
Negative Lookahead not working in perl regex
How can I use perl to delete files matching a regex
Is it possible to get a list of strings from a regex?
extract the first word from a string - regex
How to validate CNIC no in mvc by regular expression
Named backreferences with preg_replace
Regex for matching a certain number of words