Search code examples
javaregexhtml-parsingimageurl

Regular expression to extract image url from html code


I wanted to extract Url of image from html code, e.g. html code below:

<div class="imageContainer">
   <img src="http://ecx.images-amazon.com/images/I/41%2B7N48F7JL._SL135_.jpg"
      alt="" width="135" height="94"
      style="margin-top: 21px; margin-bottom:20px;" /></div>

And I got a code from net

String regexImage = "(?<=<img (*)src=\")[^\"]*";
Pattern pImage = Pattern.compile(regexImage);
Matcher mImage = pImage.matcher(elementString);
while (mImage.find()) {
   String imagePath = mImage.group();}

which is working and has re(regular expression)

"(?<=<img src=\")[^\"]*"

But now I want to extract image url from html code like below :

<img onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
   data-imagesize="thumb"
   data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
   src="http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg"
   alt="Samsung Galaxy S Duos S7562: Mobile"
   title="Samsung Galaxy S Duos S7562: Mobile"></img></a>
<div class="bp-offer-image image-offer"></div>

where there is code between img and src=

I'm trying the regular expression as "(?<=<img (*)src=\")[^\"]*" but its not working. So please give me regular expression so that i can extract image url i.e. http://ecx.images-amazon.com/images/I/61xqOQ3Sj8L._SL135_.jpg from above html code.

And, first I'm using Jsoup to parse html to extract tags containing img :

doc = Jsoup.connect(urlFromBrowse).get();
            Elements elements = doc.getElementsByTag("img");

            for (Element element : elements) {
                String elementString = element.toString();

and passed this elementString to matcher() meathod. And from the tag(element) that I'm getting, I'm using regular expression to parse image url, name etc things.


Solution

  • This post is an answer to the question, not a guideline.

    The question was not "RegExp vs DOM", the question was "Regular expression to extract image url from html code".

    Here it is:

    String htmlFragment =
       "<img onerror=\"img_onerror(this);\" data-logit=\"true\" data-pid=\"MOBDDDBRHVWQZHYY\"\n" + 
       "   data-imagesize=\"thumb\"\n" + 
       "   data-error-url=\"http://img1a.flixcart.com/mob/thumb/mobile.jpg\"\n" + 
       "   src=\"http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg\"\n" + 
       "   alt=\"Samsung Galaxy S Duos S7562: Mobile\"\n" + 
       "   title=\"Samsung Galaxy S Duos S7562: Mobile\"></img></a>";
    Pattern pattern =
       Pattern.compile( "(?m)(?s)<img\\s+(.*)src\\s*=\\s*\"([^\"]+)\"(.*)" );
    Matcher matcher = pattern.matcher( htmlFragment );
    if( matcher.matches()) {
       System.err.println(
          "OK:\n" +
          "1: '" + matcher.group(1) + "'\n" +
          "2: '" + matcher.group(2) + "'\n" +
          "3: '" + matcher.group(3) + "'\n" );
    }
    

    and the ouput:

    OK:
    1: 'onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
       data-imagesize="thumb"
       data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
       '
    2: 'http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg'
    3: '
       alt="Samsung Galaxy S Duos S7562: Mobile"
       title="Samsung Galaxy S Duos S7562: Mobile"></img></a>'