Search code examples
phphtmlfilteringhtml-parsingtext-extraction

Get all <span> texts containing non-zero dollar amounts from an HTML string


In my test, there is the following data:

>0 Dollar</span
>0.01 Dollar</span
>0.00 Dollar</span 
>50.00 Dollar</span

My desire:

I want to keep dollar amounts which are not 0.00 Dollar and 0 Dollar.

Code that I am using

$str = $table['contents'];
$pattern = "/(Need help here)/";    
$a = preg_match_all($pattern, $str, $matches);
print_r($matches);

Output should be an array whose value will be 0.01 Dollar and 50.00 Dollar.


Solution

  • You could make use of DOMDocument and DOMXPath and use preg_match as a PhpFunction inside a xpath query.

    In the example I have used //span which will get all the spans, but you can make the query more specific to your data.

    $html = <<<HTML
    <span>0 Dollar</span>
    <span>0.01 Dollar</span>
    <span>0.00 Dollar</span>
    <span>50.00 Dollar</span>
    HTML;
    
    $dom = new DOMDocument;
    $dom->loadHTML($html);
    
    $xp = new DOMXPath($dom);
    $xp->registerNamespace("php", "http://php.net/xpath");
    $xp->registerPHPFunctions('preg_match');
    $pattern = '/\A(?=[0.]*[1-9])\d+(?:\.\d+)?+\h+Dollar\z/';
    $spans = $xp->query("//span[php:functionString('preg_match', '$pattern', text())>0]");
    
    foreach ($spans as $span) {
        echo $span->nodeValue . PHP_EOL;
    }
    

    Output

    0.01 Dollar
    50.00 Dollar
    

    See a PHP demo


    If you want to use a regex only, you could match the leading > and assert the trailing <. If the previous code example, \A and z are anchors that assert the start and the end of the string.

    >\K(?=[0.]*[1-9])\d+(?:\.\d+)?+\h+Dollar(?=<)
    

    The pattern matches:

    • > Match literally
    • \K Forget what is matched so far
    • (?=[0.]*[1-9]) Positive lookahead, assert at least a digit 1-9 preceded by optional zeroes or dots
    • \d+(?:\.\d+)?+ Match 1+ digits with an optional decimal part
    • \h+Dollar
    • (?=<) Positive lookahead, assert < to the right

    Regex demo | Php demo

    For example:

    $data = <<<DATA
    >0 Dollar</span
    >0.01 Dollar</span
    >0.00 Dollar</span 
    >50.00 Dollar</span
    DATA;
    $regex = '/>\K(?=[0.]*[1-9])\d+(?:\.\d+)?+\h+Dollar(?=<)/';
    preg_match_all($regex, $data, $matches);
    var_export($matches[0]);
    

    Output

    array (
      0 => '0.01 Dollar',
      1 => '50.00 Dollar',
    )