Search code examples
phpscreen-scraping

PHP - Processing a Screen Scraped Page


I have used previous topics on how to scrape a webpage successfully using cURL and PHP. I have managed to get that part working fine, what I need to do is process some information from the page that has no identifiable classes / markup that I can use easily. The example code I have is:

<h3>Building details:</h3>
<p>Disabled ramp access<br />
  Male, female and disabled toilets available</p>
  <br/>
  <p><strong>Appointment lead times:</strong></p>
  <p><strong>Type 1</strong>:&nbsp; 8 weeks<br />
  <strong>Type 2</strong>:&nbsp;5 weeks<br />
  <strong>Type 3</strong>:&nbsp;3 weeks<br />
  <strong>Type 4</strong>:&nbsp;3 weeks
</p>

What I need to do is get the number of weeks lead time for the different types of appointment, mainly type 1. Sometimes appointment lead times are unavailable and states:

<p><strong>Appointment lead times:</strong></p>
<p><strong>Type 1</strong>:&nbsp; No information available<br />

I have looked at several methods, RegEx, Simple DOM Parser etc but haven't really got a solution to what I am trying to achieve.

Many thanks.


Solution

  • When doing this kind of thing, it can get messy. You have to find some point in the code to break it apart in a reliable way. Your sample there has one spot I can see: Type 1</strong>:&nbsp;. So, I would do this:

    $parts = explode('Type 1</strong>:&nbsp;', $text);

    Now, the first bit of $parts[1] will have either your timeframe, or the no information message. Let's use the <br /> at the end to chop it:

    if (count($parts) == 2) {
      $parts = explode('<br />', $parts[1]);
      $parts = trim(str_replace(' weeks', '', $parts[0]));
    }
    

    Now, $parts has our message, or our timeframe as a number. is_numeric will show the way! This is a dirty method, but scraping page data usually is. Be sure to check the results of each step before assuming you're good for the next.