Search code examples
phpregexdomsimplexmlpreg-match-all

How to parse <a name and <image src= inside <li tag using php?


I got a html string with lots of <li> .. </li> sets. I want to parse following data from each set of <li> ...</li> :

   1: call.php?category=fruits&amp;fruitid=123456
   2: mango season
   3: http://imagehosting.com/images/fru_123456.png

I used preg_match_all to get the first value but how to get the second and third value ? I would be happy if some show me get second and third item .Thanks in advance.

php:

preg_match_all('/getit(.*?)detailFruit/', $code2, $match);

var_dump($match);

  // iterate the new array
  for($i = 0; $i < count($match[0]); $i++)
{
$code3=str_replace('getit(\'', '', $match[0]);
$code4=str_replace('&amp;\',detailFruit', '', $code3);
echo "<br>".$code4[$i];
}

sample <li> ..</li> data:

<li><a id="FR123456" onclick="setFood(false);setSeasonFruitID('123456');getit('call.php?category=fruits&amp;fruitid=123456&amp;',detailFruit,false);">mango season</a><img src="http://imagehosting.com/images/fru_123456.png">
            </li>

Edit: I used DOM now I got 2 and 3 value how to get first value using DOM ?

libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($code2);
$xpath = new DOMXPath($dom);

// Empty array to hold all links to return
$result = array();

//Loop through each <li> tag in the dom
foreach($dom->getElementsByTagName('li') as $li) {
    //Loop through each <a> tag within the li, then extract the node value
    foreach($li->getElementsByTagName('a') as $links){
        $result[] = $links->nodeValue;
        echo $result[0] . "\n";
    }

    $imgs = $xpath->query("//li/img/@src");

foreach ($imgs as $img) {
    echo $img->nodeValue . "\n";
}
}

Solution

  • Interesting question :-) The following solution uses a combination of DOMDocument/SimpleXML to get the values 2 & 3 easily. DomDocument was used as your HTML snippet was corrupted. To actually get your link (value 1) from the JavaScript content, a simple regex was used:

    ~getit\('([^']+)'\)~
    # search for getit( and a singlequote literally
    # capture everything up to (but not including) a new single quote
    # this is saved in the group 1
    

    A complete walkthrough can be found below (obviously I made up the banana part):

    <?php
    $html = '<ul>
    <li><a id="FR123456" onclick="setFood(false);setSeasonFruitID(\'123456\');getit(\'call.php?category=fruits&amp;fruitid=123456&amp;\',detailFruit,false);">mango season</a><img src="http://imagehosting.com/images/fru_123456.png"></li>
    <li><a id="FR7890" onclick="setFood(false);setSeasonFruitID(\'7890\');getit(\'call.php?category=fruits&amp;fruitid=7890&amp;\',detailFruit,false);">bananas</a><img src="http://imagehosting.com/images/fru_7890.png"></li>
            </ul>';
    
    $dom = new DOMDocument;
    $dom->strictErrorChecking = FALSE;
    $dom->loadHTML($html);
    $xml = simplexml_import_dom($dom);
    
    # xpath to find list items
    $items = $xml->xpath("//ul/li");
    
    $regex = "~getit\('([^']+)'\)~";
    
    # loop over the items
    foreach ($items as $item) {
        $title = $item->a->__toString();
        $imgLink = $item->img["src"];
    
        $jsLink = $item->a["onclick"];
    
        preg_match_all($regex, $jsLink, $matches);
        $jsLink = $matches[1][0];
    
        echo "Title: $title, imgLink: $imgLink, jsLink: $jsLink\n";
        // output: Title: mango season, imgLink: http://imagehosting.com/images/fru_123456.png, jsLink: call.php?category=fruits&fruitid=123456&
        //         Title: bananas, imgLink: http://imagehosting.com/images/fru_7890.png, jsLink: call.php?category=fruits&fruitid=7890&
    }
    
    ?>