Search code examples
phpweb-scrapingdomdocument

Scrape data from HTML page using DOMDocument


I am new in PHP and trying to make script which can get data from external site. I am interesting in getting value of Merk which is Opel. HTML code for it is like this

<div class="row">
    <div class="col-6 col-sm-5 label" data-tooltip="<strong>Merk</strong><br/>Het merk van het voertuig. Dit wordt voor alle voertuigsoorten geregistreerd.
<span>bron: RDW</span> ">
        Merk
    <span data-toggle="tooltip" data-html="true" title="<strong>Merk</strong><br/>Het merk van het voertuig. Dit wordt voor alle voertuigsoorten geregistreerd.<br /><span>bron: RDW</span> "></span><span data-toggle="tooltip" data-html="true" title="<strong>Merk</strong><br/>Het merk van het voertuig. Dit wordt voor alle voertuigsoorten geregistreerd.<br /><span>bron: RDW</span> "></span></div>
    <div class="col-6 col-sm-7 value">
Opel
    </div>
</div>

I am trying to get it with PHP code like below

<?php
// a new dom object
$dom = new domDocument; 

// load the html into the object
$dom->loadHTML('https://centraalbeheerkentekencheck.azurewebsites.net/?kenteken=L-762-LZ'); 

// discard white space
$dom->preserveWhiteSpace = false;

$rowData= $dom->getElementsByTagName('row');

But now I am stuck and does not know how I can finish remain code so I can get value of Merk whiich is Opel. Let me know if anyone here can help me to achieve my goal.


Solution

  • I think it is better to use SimpleHtmlDom for this (like voku/simple_html_dom):

    composer install voku/simple_html_dom
    

    The SimpleHtmlDom version

    You used the url https://centraalbeheerkentekencheck.azurewebsites.net/?kenteken=L-762-LZ for this, but it contains an iframe to: https://centraalbeheer.finnik.nl/kenteken/l762lz/gratis, so I use that one instead in the script:

    use voku\helper\HtmlDomParser;
    require_once __DIR__ . "/vendor/autoload.php";
    
    function getBrand(string $license) : string
    {
        $license = strtolower(str_replace("-", "", $license));
        $dom = HtmlDomParser::file_get_html("https://centraalbeheer.finnik.nl/kenteken/".$license."/gratis");
        $brand = $dom->find(".result .row .value")[0]->innerHtml();
        return str_replace(["&#13;", "\n", "\r"], "", $brand);
    }
    
    var_dump(getBrand("L-762-LZ"));
    

    Update: You can also do this with regex

    function getBrandRegex(string $license) : string
    {
        $license = strtolower(str_replace("-", "", $license));
        $content = file_get_contents("https://centraalbeheer.finnik.nl/kenteken/".$license."/gratis");
        preg_match_all('/<div class="col-6 col-sm-7 value">(.*?)<\/div>/s', $content, $matches);
        $brand = $matches[1][0];
        return trim(str_replace(["&#13;", "\n", "\r"], "", $brand));
    }
    
    var_dump(getBrandRegex("L-762-LZ"));
    

    Update: The DomDocument version

    function getBrandDomDocument(string $license) : string
    {
        libxml_use_internal_errors(true); //see: https://www.php.net/manual/en/function.libxml-use-internal-errors.php
        $license = strtolower(str_replace("-", "", $license));
        $dom = new \DomDocument;
        $dom->loadHTMLFile("https://centraalbeheer.finnik.nl/kenteken/".$license."/gratis");
        $dom->preserveWhiteSpace = false;
    
        $xpath = new \DOMXPath($dom);
        $data = $xpath->query("//div[contains(@class, 'col-6 col-sm-7 value')]");
    
        return trim(str_replace(["&#13;", "\n", "\r"], "", $data[0]->textContent));
    }
    
    var_dump(getBrandDomDocument("L-762-LZ"));
    

    Output

    Opel