Search code examples
phpregexweb-scrapinghtml-parsingtext-extraction

Parse HTML and create an associative array of products and their prices


I am trying to extract prices for a product from a webpage using a php script. The string in question consists of the following html:

<div class="pd_warranty col-xs-12 no-padding">
    <p class="selectWty txtLeft">Available Options</p>
    <div class="vspace clear"></div>
    
<div class="subProd col-xs-4 noPadLR">
    <a href="https://www.example.com/single” class="selected">
        <div class="col-xs-12 cellTable pad5All">
            <div class="col-xs-8 noPadLR cellTableCell">
                <p class="noMar txtLeft">Single</p>
                <p class="noMar txtLeft sml">$99.99</p>
            </div>
        </div>
    </a>
</div>

<div class="subProd col-xs-4 noPadLR">
    <a href="https://www.example.com/2pack” class="">
        <div class="col-xs-12 cellTable pad5All">
            <div class="col-xs-8 noPadLR cellTableCell">
                <p class="noMar txtLeft">2-PACK</p>
                <p class="noMar txtLeft sml">$159.99</p>
            </div>
        </div>
    </a>
</div>

<div class="subProd col-xs-4 noPadLR">
    <a href="https://www.example.com/4pack” class="">
        <div class="col-xs-12 cellTable pad5All">
            <div class="col-xs-8 noPadLR cellTableCell">
                <p class="noMar txtLeft">4-PACK</p>
                <p class="noMar txtLeft sml">$249.99</p>
            </div>
        </div>
    </a>
</div>

</div> 

There are three groups of prices on most products: Single 2-PACK 4-PACK

Some pages may not have one or both 2-PACK or 4-PACK.

I failed attempting to write a regex expression to extract the info I need from a variable with the above string. I am trying to make a php regex expression to extract the words single/2-pack/4-pack and price in an array[type][price] to represent if each type is present in the html with price.


Solution

  • There will be many ways to customize the xpath and iterated node handling, but this does work on your sample string. You can refine this solution to be more or less strict depending on your needs.

    (Jakub forced me to post this answer, since I don't want you to have to resort to regex.)

    Code: (Demo)

    $dom = new DOMDocument; 
    $dom->loadHTML(str_replace ('”', '"', $html));  // normalize the quoting; extend as needed
    $xpath = new DOMXPath($dom);
    //                        actually targeting this div ---------vvv
    foreach ($xpath->evaluate("//div[contains(@class, 'subProd')]//div[contains(p/@class, 'noMar')]") as $div) {
        $type = $xpath->query("p[contains(@class, 'noMar') and not(contains(@class, 'sml'))]", $div)[0]->nodeValue;
        $price = $xpath->query("p[contains(@class, 'noMar') and contains(@class, 'sml')]", $div)[0]->nodeValue;
        $result[$type] = $price;
    }
    var_export($result);
    

    Output:

    array (
      'Single' => '$99.99',
      '2-PACK' => '$159.99',
      '4-PACK' => '$249.99',
    )
    

    To explain...

    The input for the foreach() is targeting the div that has one or more children with class attribute noMar. For every qualifying div found in the html...

    • the type text if extracted from the p element with a class that has noMar but not sml
    • the price text if extracted from the p element with a class that has noMar and sml

    I am storing the extracted data as a one-dimensional associative array.