Search code examples
phpwordpressdomxpathdomdocument

XPath: Get first paragraph after headlines


I want to add a FAQPage schema to my site.

To do so, I need to find every <h2> or <h3> tag with a question mark in it. That would be the question.

After that I need the first <p> tag after the headline as anwser.

The final result should look like this:

{
    "@type": "Question",
    "name": "How long does it take to process a refund?",
    "acceptedAnswer": {
        "@type": "Answer",
        "text": "CONTENT FROM FIRST P-TAG",
        "url": "https://www.example.com/answer#anchor_link"
    }
}
  • The "name" of the question is the <h2> or <h3> tag.
  • The "url" of the answer is the permalink and the anchor link from <h2> or <h3> tag.
  • These two parameters are solved

Unfortunately I couldn't figure out how to get the first paragraph tag after the headline tags.

I need the content of the first paragraph in the following line:

"text": "CONTENT FROM FIRST P-TAG",

Here's my current code so far:

<?php

$content_postid = get_the_ID();
$content_post   = get_post($content_postid);
$content        = $content_post->post_content;
$content        = apply_filters('the_content', $content);
$content        = str_replace(']]>', ']]&gt;', $content);

libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $content);

$xp = new DOMXPath($dom);
$query = "//h2[contains(., '?')] | //h3[contains(., '?')]";

$nodes = $xp->query($query);

$stack = [];

if ($nodes) {

    $faq_count = count($nodes);
    $faq_i = 1;
    
    echo '
    <script type="application/ld+json">
        {
            "@context": "https://schema.org",
            "@type": "FAQPage",
            "mainEntity": [';
    
        foreach($nodes as $node) {
        
            echo '{
                "@type": "Question",
                "name": "'.$node->nodeValue.'",
                "acceptedAnswer": {
                    "@type": "Answer",
                    "text": "CONTENT FROM FIRST P-TAG",
                    "url": "'.get_permalink().'#'.$node->getAttribute('id').'"
                }
            }';
            
            if ($faq_i != $faq_count) :  echo ','; endif; $faq_i++;
        
        }
    
    echo ']}</script>';

}
?>

As you can see I'm using this line to find every <h2> or <h3> tag with a ? in it:

$query = "//h2[contains(., '?')] | //h3[contains(., '?')]";

I guess I need a second $query to find the parapgrah after the headline? But how do I check for the first tag after the the headline?

I tried this extra query:

$query2 = "//h2[contains(., '?')]/following-sibling::p[1] | //h3[contains(., '?')]/following-sibling::p[1]";

But neither following-sibling:: nor following:: works for me. It shows always the paragraph after the last headline.

Do I need to adress the first query? To know on what level I am?

Here's an example of $content_post (it's always different):

<h2>Lorem ipsum dolor sit amet?</h2>

<p>consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim</p>

<p>veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.</p>

<h3>Duis autem vel eum?</h3>

<p>iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi.</p>

<h2>Nam liber tempor cum soluta?</h2>

<h3>nobis eleifend option congue nihil</h3>

<p>imperdiet doming id quod mazim placerat facer possim assum. Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.</p>

<p>Et wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.</p>

<h3>Duis autem vel?</h3>

<p>eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi.</p>

<h4>Nam liber tempor cum soluta nobis</h4>

<p>eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum.</p>

Solution

  • Try changing your foreach like this and see if it works.

    foreach($nodes as $node) {
            $ans = $xp->query("./following-sibling::p[1]",$node)[0]->nodeValue;
            echo "{
                    '@type': 'Question',
                    'name': '".$node->nodeValue."',
                    'acceptedAnswer': {
                        '@type': 'Answer',
                        'text': {$ans}
                    }
                }";