I'm dealing here with an issue on scrape because of the inconsistency of the child elements, that sometimes are present and other times missing.
Since I'm saving state referencing the $values[]
array, what I found was that sometimes $value[18]
is the email address, other times it can be the phone or fax.
The sample array of three iterations is as follows:
[0] => [
[1] => Firm: The Firm One Name
[2] => Firm:
[3] => The Firm One Name
[4] => Office: 5th Av. 18980, NY
[5] => Office:
[6] => 5th Av. 18980, NY
[7] => City: New York
[8] => City:
[9] => New York
[10] => Country: USA
[11] => Country:
[12] => USA
[13] => Tel: +123 4 567 890
[14] => Tel:
[15] => +123 4 567 890
[16] => Email: [email protected]
[17] => Email:
[18] => [email protected]
],
[1] => [
[1] => Firm: The Firm Two Name
[2] => Firm:
[3] => The Firm Two Name
[4] => Office: 5th Av. 342680, NY
[5] => Office:
[6] => 5th Av. 342680, NY
[7] => City: New York
[8] => City:
[9] => New York
[10] => Country: USA
[11] => Country:
[12] => USA
[13] => Tel: +123 4 567 890
[14] => Tel:
[15] => +123 4 567 890
[16] => Fax: +123 4 567 891
[17] => Fax:
[18] => +123 4 567 891
[19] => Email: [email protected]
[20] => Email:
[21] => [email protected]
],
[2] => [[1] => Firm: The Firm Three Name
[2] => Firm:
[3] => The Firm Three Name
[4] => Office: 5th Av. 89280, NY
[5] => Office:
[6] => 5th Av. 89280, NY
[7] => Country: USA
[8] => Country:
[9] => USA
[10] => Fax: +123 4 567 899
[11] => Fax:
[12] => +123 4 567 899
[13] => Email: [email protected]
[14] => Email:
[15] => [email protected]
]
As it might be noticeable, when I iterate and save $values[15]
of the last array, which is the email address, on the first [0][15]
corresponds to a Tel. number.
My question is, is there a simpler way than doing a 'crazy loop' over the fields and always save the email as an email and not as a phone number?
I'm using GuzzleClient()
along with $node->filterXPath()
and/or $node->filter()
depending on what I have to grab.
The html structure I'm working on is very short and simple as the example below, sometimes there are nodes missing...:
<div id="profiledtails">
<div class="abc-g">
<div class="abc-gf">
<div class="abc-u first">Firm:</div>
<div class="abc-u">
<a href="http://example.com/123456/" title="More information here" class="Item" abc-tracker="office" abc-tracking="true">Person One</a>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Office:</div>
<div class="abc-u">
<address>
5th Av.<br>18980,<br>NY
</address>
</div>
</div>
<div class="abc-gf">
<div class="abc-u first">City:</div>
<div class="abc-u">New York</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Country:</div>
<div class="abc-u">USA</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Tel:</div>
<div class="abc-u">+123 4 567 890</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Fax:</div>
<div class="abc-u">+123 4 567 891</div>
</div>
<div class="abc-gf">
<div class="abc-u first">Email:</div>
<div class="abc-u">
<a href="mailto:[email protected]">[email protected]</a></div>
</div>
</div>
After taking some rest and thinking freshly about the problem, I found the solution that gets the data sanitised as needed. After all it's just a matter of filtering the results and get the correct values in the correct place in the array. Here's what I make and works for any case (when adapted to ones needs):
$crawler->filterXPath('//*[@id="profiledetails"]/div')->each(function($node) use ($data, $start, $i) {
// get the values
foreach($node->filter('div') as $k => $v) {
$values[] = trim($v->nodeValue);
}
// sanitise the data
$sanitised = [];
foreach($values as $k => $v) {
trim($v); // trim to make sure there's no spaces
if($v == 'Firm:') {
$sanitised['firm_name'] = $values[$k + 1]; // Note: the +1 is to get the next node where the value is set
}
if($v == 'Office:') {
$sanitised['address'] = $values[$k + 1];
}
if($v == 'City:') {
$sanitised['city'] = $values[$k + 1];
}
if($v == 'Country:') {
$sanitised['country'] = $values[$k + 1];
}
if($v == 'Tel:') {
$sanitised['phone'] = $values[$k + 1];
}
if($v == 'Fax:') {
$sanitised['fax'] = $values[$k + 1];
}
if($v == 'Email:') {
$sanitised['email'] = $values[$k + 1];
}
}
$data['firm_name'] = !empty($sanitized['firm_name']) ? $sanitized['firm_name'] : null;
$data['address'] = !empty($sanitized['address']) ? nl2br($sanitized['address']) : null;
$data['city'] = !empty($sanitized['city']) ? $sanitized['city'] : null;
$data['country'] = !empty($sanitized['country']) ? $sanitized['country'] : null;
$data['phone'] = !empty($sanitized['phone']) ? $sanitized['phone'] : null;
$data['fax'] = !empty($sanitized['fax']) ? $sanitized['fax'] : null;
$data['email'] = !empty($sanitized['email']) ? $sanitized['email'] : null;
// Save the data
ProfileModel::where('id', $i)->update($data);
// just a console log to know where we are in case it fails on timeout
echo "Done for profile id " . $i . PHP_EOL;
});
The result will always, for each iteration, be a correct array even when empty or missing nodes are found. It looks like this:
[
['firm_name'] = 'Firm Name One';
['address'] = '5th Av.<br>18980,<br>NY';
['city'] = 'New Yok';
['country'] = 'USA';
['phone'] = '+123 4 567 890';
['fax'] = null;
['email'] = '[email protected]';
]
And now every single row on the DB gets the data (or NULL
) in the right columns.