Search code examples
phpdomxpath

PHP with DOM Xpath - Remove childNode and arrange string


I have this html structure:

<html>
  <body>
    <section>
      <div>
        <div>
          <section>
            <div>
              <table>
                <tbody>
                  <tr></tr>
                  <tr>
                    <td></td>
                    <td></td>
                    <td>
                      <i></i>
                      <div class="first-div class-one">
                        <div class="second-div"> soft </div>
                        130 cm / 15cm
                      </div>
                    </td>
                  </tr>
                  <tr></tr>
                </tbody>
              </table>
            </div>
          </section>
        </div>
      </div>
    </section>
  </body>
</html>

Now, I have this XPath code:

$doc = new DOMDocument();
@$doc->loadHtmlFile('http://www.whatever.com');
$doc->preserveWhiteSpace = false;

$xpath = new DOMXPath( $doc );

$nodelist = $xpath->query( '/html/body/section/div[2]/section/div/table/tbody/tr[2]/td[3]/div' );
foreach ( $nodelist as $node ) {
    $result = $node->nodeValue."\n";
}

This gets me 'soft 130 cm / 15cm' as a result.

But I want to know how to get only '15', so I need:

1. To know how to get rid of the childNode->nodeValue

2. Once I have '130 cm / 15cm', to know how to get only '15' as the nodeValue of a variable in PHP.

Can you guys help? Thanks in advance


Solution

  • Text within a tag is also a node (a child), more particularly a DOMText. By looking at the children of that div, you can find the DOMText and get its nodeValue. An example below:

    $doc = new DOMDocument();
    $doc->loadHTML("<html><body><p>bah</p>Test</body></html>");
    echo $doc->saveHTML();
    
    $xpath = new DOMXPath( $doc );
    $nodelist = $xpath->query( '/html/body' );
    foreach ( $nodelist as $node ) {
        if ($node->childNodes)
                foreach ($node->childNodes as $child) {
                        if($child instanceof DOMText)
                                echo $child->nodeValue."\n"; // should output "Test".
                }
    }
    

    Your second point can easily be done with regular expressions:

    $string = "130 cm / 15cm";
    
    $matches = array();
    preg_match('|/ ([0-9]+) ?cm$|', $string, $matches);
    
    echo $matches[1];
    

    Full Solution:

    <?php
    
    $strhtml = '
    <html>
      <body>
        <section>
          <div>
            <div>
              <section>
                <div>
                  <table>
                    <tbody>
                      <tr></tr>
                      <tr>
                        <td></td>
                        <td></td>
                        <td>
                          <i></i>
                          <div class="first-div class-one">
                            <div class="second-div"> soft </div>
                            130 cm / 15cm
                          </div>
                        </td>
                      </tr>
                      <tr></tr>
                    </tbody>
                  </table>
                </div>
              </section>
            </div>
          </div>
        </section>
      </body>
    </html>';
    
    $doc = new DOMDocument();
    @$doc->loadHTML($strhtml);
    echo $doc->saveHTML();
    
    $xpath = new DOMXPath( $doc );
    $nodelist = $xpath->query( '/html/body/section/div/div/section/div/table/tbody/tr[2]/td[3]/div' );
    foreach ( $nodelist as $node ) {
        if ($node->childNodes)
            foreach ($node->childNodes as $child) {
                if($child instanceof DOMText && trim($child->nodeValue) != "")
                {
                    echo 'Raw: '.trim($child->nodeValue)."\n";
                    $matches = array();
                    preg_match('|/ ([0-9]+) ?cm$|', trim($child->nodeValue), $matches);
                    echo 'Value: '.$matches[1]."\n";
                }
           }
    }