Search code examples
phpdomdocument

PHP create recursive list of header tags from DOM


I want to parse some HTML to create a nested navigation based on the headings in that document.

An array like this is what i'm trying to create:

[
  'name' => 'section 1',
  'number' => '1',
  'level' => 1,
  'children' => [
    [
      'name' => 'sub section 1',
      'number' => '1.1',
      'level' => 2,
      'children' => []
    ],
    [
      'name' => 'sub section 2',
      'number' => '1.2',
      'level' => 2,
      'children' => []
    ]
  ],
]

So if the document has a H3 after a H2 the code can then parse this and create a nested array with child elements for each successive tier of H headings

I guess it needs to do a few main things:

  • Get all of the headings
  • Recursively loop (H3 after a H2 should be a child in the array)
  • Create the section number 1.1.1 or 1.1.2 for example

This is my code to extract the headings:

$dom = new \DomDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

// Extract the heading structure
$xpath = new \DomXPath($dom);
$headings = $xpath->query('//h1|//h2|//h3|//h4|//h5|//h6');

I've tried to create a recursive function but i'm not sure on the best way to get it working


Solution

  • It's very difficult to test as this will depend on how complex the HTML is and the specific pages you use. Also as the code does a lot, I will leave it up to you to work out what it does as an explanation would go on for some time. The XPath was created using XPath select all elements between two specific elements as a reference to pick out the data between two tags. The test source (test.html) is merely....

    <html>
    <head>
    </head>
    <body>
        <h2>Header 1</h2>
        <h2>Header 2</h2>
        <h3>Header 2.1</h3>
        <h4>Header 2.1.1</h4>
        <h2>Header 3</h2>
        <h3>Header 3.1</h3>
    </body>
    </html>
    

    The actual code is...

    function extractH ( $level, $xpath, $dom, $position = 0, $number = ''  )  {
        $output = [];
        $prevLevel = $level-1;
        $headings = $xpath->query("//*/h{$level}[count(preceding-sibling::h{$prevLevel})={$position}]");
        foreach ( $headings as $key => $heading )   {
            $sectionNumber = ltrim($number.".".($key+1), ".");
            $newOutput = ["name" => $heading->nodeValue,
                "number" => $sectionNumber,
                "level" => $level
                ];
            $children = extractH($level+1, $xpath, $dom, $key+1, $sectionNumber);
            if ( !empty($children) )    {
                $newOutput["children"] = $children;
            }
            $output[] =$newOutput;
        }
    
        return $output;
    }
    
    $html = file_get_contents("test.html");
    $dom = new \DomDocument();
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    $xpath = new \DomXPath($dom);
    $output = extractH(2, $xpath, $dom);
    print_r($output);
    

    The call to extractH() has few parameters. As the sample HTML only starts with h2 tags (no h1) then the first parameter is 2. Then the XPath and DomDocument objects to work with.