Search code examples
phphtmldomsimple-html-dom

How to use simplehtmldom to extract data from this page


I am trying to extract information from https://benthamopen.com/browse-by-title/B/1/ using simplehtmldom.

Specifically, I want to access the parts of the page that says:

<div style="padding:10px;">
<strong>ISSN: </strong>1874-1207<br><div class="sharethis-inline-share-buttons" style="padding-top:10px;" data-url="https://benthamopen.com/TOBEJ/home/" data-title="The Open Biomedical Engineering Journal"></div>
</div>

I have this code:

$html = file_get_html('https://benthamopen.com/browse-by-title/B/1/');

foreach($html->find('div[style=padding:10px;]') as $ele) {
    echo("<pre>".print_r($ele,true)."</pre>");
}

... which returns (I only show one item from the page)

simplehtmldom\HtmlNode Object
(
    [nodetype] => HDOM_TYPE_ELEMENT (1)
    [tag] => div
    [attributes] => Array
        (
            [style] => padding:10px;
        )

    [nodes] => Array
        (
            [0] => simplehtmldom\HtmlNode Object
                (
                    [nodetype] => HDOM_TYPE_ELEMENT (1)
                    [tag] => strong
                    [attributes] => none
                    [nodes] => none
                )

            [1] => simplehtmldom\HtmlNode Object
                (
                    [nodetype] => HDOM_TYPE_TEXT (3)
                    [tag] => text
                    [attributes] => none
                    [nodes] => none
                )

            [2] => simplehtmldom\HtmlNode Object
                (
                    [nodetype] => HDOM_TYPE_ELEMENT (1)
                    [tag] => br
                    [attributes] => none
                    [nodes] => none
                )

            [3] => simplehtmldom\HtmlNode Object
                (
                    [nodetype] => HDOM_TYPE_ELEMENT (1)
                    [tag] => div
                    [attributes] => Array
                        (
                            [class] => sharethis-inline-share-buttons
                            [style] => padding-top:10px;
                            [data-url] => https://benthamopen.com/TOBEJ/home/
                            [data-title] => The Open Biomedical Engineering Journal
                        )

                    [nodes] => none
                )

        )

)

I am unsure how to proceed from here. I want to extract:

  • the ISSN text (which does not show in the echo statement - not sure why) [1874-1207 in the above example]. It is element zero of [nodes]
  • the 'data-url' [https://benthamopen.com/TOBEJ/home/, in the above example]
  • the 'data-title' [The Open Biomedical Engineering Journal, in the above example]

Perhaps my understanding of PHP objects and arrays is not as good as it should be, and I do not know why the ISSN does not show in the echo statement.

I have tried various (many) things, but just struggling to extract the data from the element.


Solution

  • I'm not familiar with simplehtmldom, other than to know to avoid it. So I'll present a solution that uses PHP's built-in DOM classes:

    <?php
    libxml_use_internal_errors(true);
    // get the HTML
    $html = file_get_contents("https://benthamopen.com/browse-by-title/B/1/");
    
    // create a DOM object and load it up
    $dom = new DomDocument();
    $dom->loadHtml($html);
    
    // create an XPath object and query it
    $xpath = new DomXPath($dom);
    $elements = $xpath->query("//div[@style='padding:10px;']");
    
    // loop through the matches
    foreach ($elements as $el) {
        // skip elements without ISSN
        $text = trim($el->textContent);
        if (strpos($text, "ISSN") !== 0) {
            continue;
        }
        // get the first div inside this thing
        $div = $el->getElementsByTagName("div")[0];
        // dump it out
        printf("%s %s %s<br/>\n", str_replace("ISSN: ", "", $text), $div->getAttribute("data-title"), $div->getAttribute("data-url"));
    }
    

    The XPath stuff can be a bit overwhelming, but for simple searches like this it's not much different than the CSS selectors. Hopefully the comments explain everything, let me know if not!

    Output:

    1874-1207 The Open Biomedical Engineering Journal https://benthamopen.com/TOBEJ/home/<br/>
    1874-1967 The Open Biology Journal https://benthamopen.com/TOBIOJ/home/<br/>
    1874-091X The Open Biochemistry Journal https://benthamopen.com/TOBIOCJ/home/<br/>
    1875-0362 The Open Bioinformatics Journal https://benthamopen.com/TOBIOIJ/home/<br/>
    1875-3183 The Open Biomarkers Journal https://benthamopen.com/TOBIOMJ/home/<br/>
    2665-9956 The Open Biomaterials Science Journal https://benthamopen.com/TOBMSJ/home/<br/>
    1874-0707 The Open Biotechnology Journal https://benthamopen.com/TOBIOTJ/home/<br/>