Search code examples
phphtml-parsingtext-extraction

Parse an HTML document and get node value that contains specified content


I have 3 message blocks.

Example:

<!-- message -->
    <div>
        Just the text.
    </div>
<!-- / message -->

<!-- message -->
    <div>
        <div style="margin-left: 20px; margin-top:5px; ">
            <div class="smallfont">Quote:</div>
        </div>
        <div style="margin-right: 20px; margin-left: 20px; padding: 10px;">
            Message from <strong>Nickname</strong> &nbsp;
                <div style="font-style:italic">Hello. It's a quote</div>
                <else /></if>
        </div>
        <br /><br />
        It's the simple text
    </div>
<!-- / message -->

<!-- message -->
    <div>
        Text<br />
        <div style="margin:20px; margin-top:5px; background-color: #30333D">
            <div class="smallfont" style="margin-bottom:2px">PHP code:</div>
            <div class="alt2" style="margin:0px; padding:6px; border:1px inset; width:640px; height:482px; overflow:auto; background-color:#FFFACA;">
                <code style="white-space:nowrap">
                    <div dir="ltr" style="text-align:left">
                        <!-- php buffer start -->
                            <code>
                                LALALA PHP CODE
                            </code>
                        <!-- php buffer end -->
                    </div>
                </code>
            </div>
        </div><br />
        <br />
        More text
    </div>
<!-- / message -->

I'm trying to make a regular expression for these blocks, but does not work.

preg_match('#<!-- message -->(?P<text>.*?)</div>.*?<!-- / message -->#is', $str, $s);

It works only for first block..

How to make it so that the regular expression checks whether there is a quote in a message or php code?

(?P<text>.*?) for text

(?P<phpcode>.*?) for php code

(?P<quotenickname>.*?) for quoted nickname

(?P<quotemessage>.*?) for quote message

and etc...

Thank you so much!!!!

CHANGES FOR onteria_

<!-- message -->
    <div>
        Just the text. <b>bold text</b><br/>
        <a href="link">link</a>, <s><i>test</i></s>        
    </div>
<!-- / message -->

Output:

Just the text
,

What do I need to fix that conclusion was, along with "a", "b", "s", "i", and etc.. How to make sure that html was not removed?


Solution

  • Notices those responses about not using regex? Why is that? Well that's because HTML represents structure. Thought to be honest that HTML code overuses divs instead of using semantic markup but I'm going to parse it anyways with DOM functions. So then, here's the sample HTML I used:

    <html>
    <body>
    <!-- message -->
        <div>
            Just the text.
        </div>
    <!-- / message -->
    
    <!-- message -->
        <div>
            <div style="margin-left: 20px; margin-top:5px; ">
                <div class="smallfont">Quote:</div>
            </div>
            <div style="margin-right: 20px; margin-left: 20px; padding: 10px;">
                Message from <strong>Nickname</strong> &nbsp;
                    <div style="font-style:italic">Hello. It's a quote</div>
            </div>
            <br /><br />
            It's the simple text
        </div>
    <!-- / message -->
    
    <!-- message -->
        <div>
            Text<br />
            <div style="margin:20px; margin-top:5px; background-color: #30333D">
                <div class="smallfont" style="margin-bottom:2px">PHP code:</div>
                <div class="alt2" style="margin:0px; padding:6px; border:1px inset; width:640px; height:482px; overflow:auto; background-color:#FFFACA;">
                    <code style="white-space:nowrap">
                        <div dir="ltr" style="text-align:left">
                            <!-- php buffer start -->
                                <code>
                                    LALALA PHP CODE
                                </code>
                            <!-- php buffer end -->
                        </div>
                    </code>
                </div>
            </div><br />
            <br />
            More text
        </div>
    <!-- / message -->
    </body>
    </html>
    

    Now for the full code:

    $doc = new DOMDocument();
    $doc->loadHTMLFile('test.html');
    
    
    // These just  make the code nicer
    // We could just inline them if we wanted to
    // ----------- Helper Functions ------------
    function HasQuote($part, $xpath) {
      // check the div and see if it contains "Quote:" inside
      return $xpath->query("div[contains(.,'Quote:')]", $part)->length;
    }
    
    function HasPHPCode($part, $xpath) {
      // check the div and see if it contains "PHP code:" inside
      return $xpath->query("div[contains(.,'PHP code:')]", $part)->length;
    }
    // ----------- End Helper Functions ------------
    
    
    // ----------- Parse Functions ------------
    function ParseQuote($quote, $xpath) {
      // The quote content is actually the next
      // next div over. Man this markup is weird.
      $quote = $quote->nextSibling->nextSibling;
    
      $quote_info = array('type' => 'quote');
    
      $nickname = $xpath->query("strong", $quote);
      if($nickname->length) {
        $quote_info['nickname'] = $nickname->item(0)->nodeValue;
      }
    
      $quote_text = $xpath->query("div", $quote);
      if($quote_text->length) {
        $quote_info['quote_text'] = trim($quote_text->item(0)->nodeValue);
      }
    
      return $quote_info;
    }
    
    function ParseCode($code, $xpath) {
      $code_info = array('type' => 'code');
    
      // This matches the path to get down to inner most code element
      $code_text = $xpath->query("//div/code/div/code", $code);
      if($code_text->length) {
        $code_info['code_text'] = trim($code_text->item(0)->nodeValue);
      }
    
      return $code_info;
    }
    
    // ----------- End Parser Functions ------------
    
    function GetMessages($message, $xpath) {
    
      $message_contents = array();
    
      foreach($message->childNodes as $child) {
    
        // So inside of a message if we hit a div
        // We either have a Quote or PHP code, check which
        if(strtolower($child->nodeName) == 'div') {
          if(HasQuote($child, $xpath)) {
        $quote = ParseQuote($child, $xpath);
        if($quote['quote_text']) {
          $message_contents[] = $quote;
        }
          }
          else if(HasPHPCode($child, $xpath)) {
        $phpcode = ParseCode($child, $xpath);
        if($phpcode['code_text']) {
          $message_contents[] = $phpcode;
        }
          }
        }
        // Otherwise check if we've found some pretty text
        else if ($child->nodeType == XML_TEXT_NODE) {
          // This might be just whitespace, so check that it's not empty
          $text = trim($child->nodeValue);
          if($text) {
        $message_contents[] = array('type' => 'text', 'text' => trim($child->nodeValue));
          }
        }
    
      }
    
      return $message_contents;
    }
    
    $xpath = new DOMXpath($doc);
    // We need to get the toplevel divs, which
    // are the messages
    $toplevel_divs = $xpath->query("//body/div");
    
    $messages = array();
    foreach($toplevel_divs as $toplevel_div) {
      $messages[] = GetMessages($toplevel_div, $xpath);
    }
    

    Now let's see what $messages looks like:

    Array
    (
        [0] => Array
            (
                [0] => Array
                    (
                        [type] => text
                        [text] => Just the text.
                    )
    
            )
    
        [1] => Array
            (
                [0] => Array
                    (
                        [type] => quote
                        [nickname] => Nickname
                        [quote_text] => Hello. It's a quote
                    )
    
                [1] => Array
                    (
                        [type] => text
                        [text] => It's the simple text
                    )
    
            )
    
        [2] => Array
            (
                [0] => Array
                    (
                        [type] => text
                        [text] => Text
                    )
    
                [1] => Array
                    (
                        [type] => code
                        [code_text] => LALALA PHP CODE
                    )
    
                [2] => Array
                    (
                        [type] => text
                        [text] => More text
                    )
    
            )
    
    )
    

    It's separated by message and then further separated into the different content in the message! Now we can even use a basic print function like this:

    foreach($messages as $message) {
      echo "\n\n>>>>>> Message >>>>>>>\n";
      foreach($message as $content) {
        if($content['type'] == 'text') {
          echo "{$content['text']} ";
        }
        else if($content['type'] == 'quote') {
          echo "\n\n======== Quote =========\n";
          echo "From: {$content['nickname']}\n\n";
          echo "{$content['quote_text']}\n";
          echo "=====================\n\n";
        }
        else if($content['type'] == 'code') {
          echo "\n\n======== Code =========\n";
          echo "{$content['code_text']}\n";
          echo "=====================\n\n";
        }
      }
    }
    
    echo "\n";
    

    And we get this!

    >>>>>> Message >>>>>>>
    Just the text. 
    
    >>>>>> Message >>>>>>>
    
    
    ======== Quote =========
    From: Nickname
    
    Hello. It's a quote
    =====================
    
    It's the simple text 
    
    >>>>>> Message >>>>>>>
    Text 
    
    ======== Code =========
    LALALA PHP CODE
    =====================
    
    More text 
    

    This all works, once again, because the DOM parsing functions are able to understand structure.