Search code examples
phparraysdomxpathhtml-parsing

Get class value and text from qualifying span tags in html document


Please help me with the following pattern for preg_match_all

How to change my pattern to get the desired output?

In a string search for tags with a class name like 'email_' (email_ OR email_p_12 OR email_22 OR email_xx)

get the text between tags <span class=" xx email_xx xx "> THE EMAIL ADDRESS </span>

get the classname starting with 'email_'

This is my pattern : $pattern = '~<span class=\"((.*?)*)*(email_(.*?))?(.*?)\">(.*?)</span>~';

What I need is an Array like this:

Array
(
    [0] => Array
        (
            [mail] => labore@et.de
            [class] => email_p_14
        )

    [1] => Array
        (
            [mail] => esse@cillum.de
            [class] => email_p_22
        )

    [2] => Array
        (
            [mail] => anim@id.de
            [class] => email_ 
        )

    [3] => Array
        (
            [mail] => laboris@nisi.de
            [class] => email_
        )

)

File:

<?php
    
$string = '
<p>
Lorem ipsum dolor sit amet, 
consectetur adipisicing elit, 
sed do eiusmod tempor incididunt ut

    <span class=" red email_p_14">labore@et.de</span>

dolore magna aliqua. Ut enim ad minim veniam, 
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea consequat. 
Duis aute irure in reprehenderit in voluptate velit

    <span class="email_p_22">esse@cillum.de</span>

dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, 
sunt in culpa qui officia deserunt mollit

    <span class="blue email_ green">anim@id.de</span>

laborum. Donec elementum ligula.
Quis nostrud exercitation ullamco 

    <span class="blue email_ green black">laboris@nisi.de</span>

aliquip ex ea consequat. 
</p>';


/* Looking for these:

<span class=" red email_p_14">labore@et.de</span>
<span class="email_p_22">esse@cillum.de</span>
<span class="blue email_ green">anim@id.de</span>
<span class="blue email_ green black">laboris@nisi.de</span>

*/


$pattern = '~<span class=\"((.*?)*)*(email_(.*?))?(.*?)\">(.*?)</span>~';

preg_match_all($pattern, $string, $m);

$clean_array = array_filter(array_map('array_filter', $m));

ksort($clean_array);
$output = Array();

foreach($clean_array as $row) {
    foreach($row as $key => $val){
        $output[$key][]=$val;
    }
}
print("<pre>".print_r($output,true)."</pre>");

This is what i get:

Array
(
    [0] => Array
        (
            [0] => labore@et.de
            [1] =>  red email_p_14
            [2] => labore@et.de
        )

    [1] => Array
        (
            [0] => esse@cillum.de
            [1] => email_
            [2] => p_22
            [3] => esse@cillum.de
        )

    [2] => Array
        (
            [0] => anim@id.de
            [1] => blue email_ green
            [2] => anim@id.de
        )

    [3] => Array
        (
            [0] => laboris@nisi.de
            [1] => blue email_ green black
            [2] => laboris@nisi.de
        )

)
    

What I need is an Array like this:

Array
(
    [0] => Array
        (
            [mail] => labore@et.de
            [class] => email_p_14
        )

    [1] => Array
        (
            [mail] => esse@cillum.de
            [class] => email_p_22
        )

    [2] => Array
        (
            [mail] => anim@id.de
            [class] => email_ 
        )

    [3] => Array
        (
            [mail] => laboris@nisi.de
            [class] => email_
        )

)
*/

Solution

  • Parse html with DOMDocument and XPath. Once you have targeted the appropriate nodes, dig in and extract the data, then push the new subarrays into the result.

    Code: (Demo)

    $dom = new DOMDocument;
    libxml_use_internal_errors(true);
    $dom->loadHTML($string);
    $xpath = new DOMXPath($dom);
    
    $result = [];
    foreach ($xpath->query("//span[starts-with(@class, 'email_') or contains(@class, ' email_')]") as $span) {
        $result[] = [
             'mail' => $span->nodeValue,
             'class' => preg_replace(
                 '~.*\b(email_\S*).*~',
                 '$1',
                 $span->getAttribute('class')
             )
        ];
    }
    var_export($result);
    

    Output:

    array (
      0 => 
      array (
        'mail' => 'labore@et.de',
        'class' => 'email_p_14',
      ),
      1 => 
      array (
        'mail' => 'esse@cillum.de',
        'class' => 'email_p_22',
      ),
      2 => 
      array (
        'mail' => 'anim@id.de',
        'class' => 'email_',
      ),
      3 => 
      array (
        'mail' => 'laboris@nisi.de',
        'class' => 'email_',
      ),
    )