Search code examples
phphtmldomhtml-parsingtext-extraction

Get the text from all elements with a nominated class as a flat array


I know we can use PHP DOM to parse HTML using PHP, but I have a specific requirement. I have an HTML content like below

<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>

I want to parse the above HTML and save the content into two different arrays like:

$heading and $content

$heading = array('Chapter 1', 'Chapter 2', 'Chapter 3');
$content = array('This is chapter 1', 'This is chapter 2', 'This is chapter 3');

I can achieve this simply using jQuery. But I am not sure, if that's the right way.


Solution

  • Try to look at PHP Simple HTML DOM Parser

    It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class

    // include/require the simple html dom parser file
    
    $html_string = '
        <p class="Heading1-P">
            <span class="Heading1-H">Chapter 1</span>
        </p>
        <p class="Normal-P">
            <span class="Normal-H">This is chapter 1</span>
        </p>
        <p class="Heading1-P">
            <span class="Heading1-H">Chapter 2</span>
        </p>
        <p class="Normal-P">
            <span class="Normal-H">This is chapter 2</span>
        </p>
        <p class="Heading1-P">
            <span class="Heading1-H">Chapter 3</span>
        </p>
        <p class="Normal-P">
            <span class="Normal-H">This is chapter 3</span>
        </p>';
    $html = str_get_html($html_string);
    foreach($html->find('span') as $element) {
        if ($element->class === 'Heading1-H') {
            $heading[] = $element->innertext;
        }else if($element->class === 'Normal-H') {
            $content[] = $element->innertext;
        }
    }