java html regex web-scraping data-processing

web scraping and data processing in java

I am writing a web scraper program to extract stock quotes from yahoo finance,google finance or nasdaq. I can get the html element containing the stock prices but I only need the dollar value from the result. For example the sample output looks like the image below: enter image description here

I am using an image here because when I posted the actual html, only the dollar amounts (the desired results) showed up, the html entities and tags vanished. Here is my code enter image description here I am not very familiar with regEx but I tried it but no luck. How can I extract only the dollar amount from the output?

Solution

Try using java.util.regex.Matcher and java.util.regex.Pattern:

String pattern = "<td>\\$&.+;(\\d{1,4}\\.\\d{2})&";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(inputLine);

if (m.find( )) {
     System.out.println("Price: $" + m.group(1) );
}

Result:

Price: $130.27 ...

Example:

http://ideone.com/fWgvL5#stdout