Search code examples
regexpreg-match

regex selecting from html


I have this kind of text in which I am looking to extract following text

Company Name ASSOCIATES LLP
                    18-20, FLOOR,, BUILDING,
                    K MARG, NEW - 110001
                    Delhi
                    +(91)124-0000000
                    email@EMAIL.COM

Here is the code block The regex I am using is /Name and address of the Employer(.*)<p>/ but this is making the selection till last <p>

<p><b>Certificate under Section 203 of the Income-tax Act, 1961 for tax deducted at source on salary
            </b></p>
        <p><b>Name and address of the Employer
            </b></p>
        <p>Company Name ASSOCIATES LLP
            18-20, FLOOR,, BUILDING,
            K MARG, NEW - 110001
            Delhi
            +(91)124-0000000
            email@EMAIL.COM
        </p>
        <p><b>Name and address of the Employee
            </b></p>
        <p>EMPLOYEE NAME
            EMPLOYEE ADDRESS HERE
        </p>
        <p><b>PAN of the Deductor
            </b></p>
        <p>ACHFS9000A
        </p>
        <p><b>TAN of the Deductor
            </b></p>
        <p>DELS50000E
        </p>

Solution

  • You can use DOMDocument and DOMXPath to extract the content of p tag that is next sibling of the p node having b subnode with contents containing Name and address of the Employer with this query:

    $xp->query("//p[contains(./b, 'Name and address of the Employer')]");
    

    See PHP sample code:

    <?php
    $html = <<<HTML
    <p><b>Certificate under Section 203 of the Income-tax Act, 1961 for tax deducted at source on salary
            </b></p>
        <p><b>Name and address of the Employer
            </b></p>
        <p>Company Name ASSOCIATES LLP
            18-20, FLOOR,, BUILDING,
            K MARG, NEW - 110001
            Delhi
            +(91)124-0000000
            email@EMAIL.COM
        </p>
        <p><b>Name and address of the Employee
            </b></p>
        <p>EMPLOYEE NAME
            EMPLOYEE ADDRESS HERE
        </p>
        <p><b>PAN of the Deductor
            </b></p>
        <p>ACHFS9000A
        </p>
        <p><b>TAN of the Deductor
            </b></p>
        <p>DELS50000E
        </p>
    HTML;
    $dom = new DOMDocument;
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
    $xp = new DOMXPath($dom);
    $links = $xp->query("//p[contains(./b, 'Name and address of the Employer')]");
    foreach ($links as $link) {
        echo $link->nextSibling->nodeValue;
    }
    

    See IDEONE demo