Search code examples
phpparsingweb-scrapingguzzle

Randomly missing nodes in html when scraping with GuzzleClient


I'm dealing here with an issue on scrape because of the inconsistency of the child elements, that sometimes are present and other times missing.

Since I'm saving state referencing the $values[] array, what I found was that sometimes $value[18] is the email address, other times it can be the phone or fax.

The sample array of three iterations is as follows:

[0] => [
    [1] => Firm: The Firm One Name
    [2] => Firm:
    [3] => The Firm One Name
    [4] => Office: 5th Av. 18980, NY
    [5] => Office:
    [6] => 5th Av. 18980, NY
    [7] => City: New York 
    [8] => City:
    [9] => New York
    [10] => Country: USA
    [11] => Country:
    [12] => USA
    [13] => Tel: +123 4 567 890
    [14] => Tel:
    [15] => +123 4 567 890
    [16] => Email: [email protected]
    [17] => Email:
    [18] => [email protected]
],
[1] => [
    [1] => Firm: The Firm Two Name
    [2] => Firm:
    [3] => The Firm Two Name
    [4] => Office: 5th Av. 342680, NY
    [5] => Office:
    [6] => 5th Av. 342680, NY
    [7] => City: New York
    [8] => City:
    [9] => New York
    [10] => Country: USA
    [11] => Country:
    [12] => USA
    [13] => Tel: +123 4 567 890
    [14] => Tel:
    [15] => +123 4 567 890
    [16] => Fax: +123 4 567 891
    [17] => Fax:
    [18] => +123 4 567 891
    [19] => Email: [email protected]
    [20] => Email:
    [21] => [email protected]
],
    [2] => [[1] => Firm: The Firm Three Name
    [2] => Firm:
    [3] => The Firm Three Name
    [4] => Office: 5th Av. 89280, NY
    [5] => Office:
    [6] => 5th Av. 89280, NY
    [7] => Country: USA
    [8] => Country:
    [9] => USA
    [10] => Fax: +123 4 567 899
    [11] => Fax:
    [12] => +123 4 567 899
    [13] => Email: [email protected]
    [14] => Email:
    [15] => [email protected]
]

As it might be noticeable, when I iterate and save $values[15] of the last array, which is the email address, on the first [0][15] corresponds to a Tel. number.

My question is, is there a simpler way than doing a 'crazy loop' over the fields and always save the email as an email and not as a phone number?

I'm using GuzzleClient() along with $node->filterXPath() and/or $node->filter() depending on what I have to grab.

The html structure I'm working on is very short and simple as the example below, sometimes there are nodes missing...:

<div id="profiledtails">
<div class="abc-g">
    <div class="abc-gf">
        <div class="abc-u first">Firm:</div>
        <div class="abc-u">
            <a href="http://example.com/123456/" title="More information here" class="Item" abc-tracker="office" abc-tracking="true">Person One</a>
        </div>
    </div>
    <div class="abc-gf">
        <div class="abc-u first">Office:</div>
        <div class="abc-u">
            <address>
                5th Av.<br>18980,<br>NY
            </address>
        </div>
    </div>
    <div class="abc-gf">
        <div class="abc-u first">City:</div>
        <div class="abc-u">New York</div>
    </div>
    <div class="abc-gf">
        <div class="abc-u first">Country:</div>
        <div class="abc-u">USA</div>
    </div>
    <div class="abc-gf">
        <div class="abc-u first">Tel:</div>
        <div class="abc-u">+123 4 567 890</div>
    </div>
    <div class="abc-gf">
        <div class="abc-u first">Fax:</div>
        <div class="abc-u">+123 4 567 891</div>
    </div>
    <div class="abc-gf">
        <div class="abc-u first">Email:</div>
        <div class="abc-u">
            <a href="mailto:[email protected]">[email protected]</a></div>
    </div>
</div>


Solution

  • After taking some rest and thinking freshly about the problem, I found the solution that gets the data sanitised as needed. After all it's just a matter of filtering the results and get the correct values in the correct place in the array. Here's what I make and works for any case (when adapted to ones needs):

    $crawler->filterXPath('//*[@id="profiledetails"]/div')->each(function($node) use ($data, $start, $i) {
    
        // get the values
        foreach($node->filter('div') as $k => $v) {
            $values[] = trim($v->nodeValue);
        }
    
        // sanitise the data
        $sanitised = [];
        foreach($values as $k => $v) {
            trim($v); // trim to make sure there's no spaces
            if($v == 'Firm:') {
                $sanitised['firm_name'] = $values[$k + 1]; // Note: the +1 is to get the next node where the value is set
            }
            if($v == 'Office:') {
                $sanitised['address'] = $values[$k + 1];
            }
            if($v == 'City:') {
                $sanitised['city'] = $values[$k + 1];
            }
            if($v == 'Country:') {
                $sanitised['country'] = $values[$k + 1];
            }
            if($v == 'Tel:') {
                $sanitised['phone'] = $values[$k + 1];
            }
            if($v == 'Fax:') {
                $sanitised['fax'] = $values[$k + 1];
            }
            if($v == 'Email:') {
                $sanitised['email'] = $values[$k + 1];
            }
        }
    
        $data['firm_name'] = !empty($sanitized['firm_name']) ? $sanitized['firm_name'] : null;
        $data['address'] = !empty($sanitized['address']) ? nl2br($sanitized['address']) : null;
        $data['city'] = !empty($sanitized['city']) ? $sanitized['city'] : null;
        $data['country'] = !empty($sanitized['country']) ? $sanitized['country'] : null;
        $data['phone'] = !empty($sanitized['phone']) ? $sanitized['phone'] : null;
        $data['fax'] = !empty($sanitized['fax']) ? $sanitized['fax'] : null;
        $data['email'] = !empty($sanitized['email']) ? $sanitized['email'] : null;
    
        // Save the data    
        ProfileModel::where('id', $i)->update($data);
        // just a console log to know where we are in case it fails on timeout
        echo "Done for profile id " . $i . PHP_EOL;    
    });
    

    The result will always, for each iteration, be a correct array even when empty or missing nodes are found. It looks like this:

    [ 
        ['firm_name'] = 'Firm Name One';
        ['address'] = '5th Av.<br>18980,<br>NY';
        ['city'] = 'New Yok';
        ['country'] = 'USA';
        ['phone'] = '+123 4 567 890';
        ['fax'] = null;
        ['email'] = '[email protected]';
    ]
    

    And now every single row on the DB gets the data (or NULL ) in the right columns.