Search code examples
phpweb-scrapinghtml-parsingdomdocumentsiblings

Scrape sibling tags and associate as a parent-child relationship


I want to extract content from two different tags using PHP. I want to associate h2 tags with the div tags' content that immediately follows them -- like a parent-child relationship.

<h1>Title 1</h1>
<div class="items">some data and divs here 1</div>
<h1>Title 2</h1>
<div class="items">some data and divs here 2</div>
<div class="items">some data and divs here 3</div>
<h1>Title 3</h1>
<div class="items">some data and divs here 4</div>
<div class="items">some data and divs here 5</div>
<div class="items">some data and divs here 6</div>

The number of items between two H1 tag is different.

I know how to scrape all tags with simple_html_dom or Goutte\Client to get:

<h1>Title 1</h1>
<h1>Title 2</h1>
<h1>Title 3</h1>

Or

<div class="items">some data and divs here 1</div>
<div class="items">some data and divs here 2</div>
<div class="items">some data and divs here 3</div>
<div class="items">some data and divs here 4</div>
<div class="items">some data and divs here 5</div>
<div class="items">some data and divs here 6</div>

But I am unable to associate the title to the data. I cannot figure out how to have an array like this:

array (
  0 => 
  array (
    'item' => 'Title 1',
    'data' => 'some data and divs here 1',
  ),
  1 => 
  array (
    'item' => 'Title 2',
    'data' => 'some data and divs here 2',
  ),
  2 => 
  array (
    'item' => 'Title 2',
    'data' => 'some data and divs here 3',
  ),
  3 => 
  array (
    'item' => 'Title 3',
    'data' => 'some data and divs here 4',
  ),
  4 => 
  array (
    'item' => 'Title 3',
    'data' => 'some data and divs here 5',
  ),
  5 => 
  array (
    'item' => 'Title 3',
    'data' => 'some data and divs here 6',
  ),
)

I've tried to implement something like sibling, but didn't find a way.


Solution

  • Based on the answer on XPath until next tag, I've made very few modifications to generate the desired result.

    Code: (Demo)

    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $xpath = new DOMXpath($doc);
    $domNodeList = $xpath->query('/html/body/h1');
    
    $result = [];
    foreach($domNodeList as $element) {
        // Save the h1
        $item = $element->nodeValue;
    
        // Loop the siblings unit the next h1
        while ($element = $element->nextSibling) {
            if ($element->nodeName === "h1") {
                break;
            }
            // if Node is a DOMElement
            if ($element->nodeType === 1) {
                $result[] = ['item' => $item, 'data' => $element->nodeValue];
            }
        }
    }
    var_export($result);