Search code examples
phpregexweb-scrapingexpressionscreen-scraping

Scraping data from data.gov.uk / Regular Expression


I'm trying to work out what the regular expression is that I should be using in order to scrape some data from the gov.uk website.

Basically, I am using a file_get_contents on the following URL:

https://www.compare-school-performance.service.gov.uk/?keywords=[SCHOOL-NAME]&suggestionurn=&searchtype=search-by-name

As an example - The+Castle+School in place of [SCHOOL-NAME].

This returns 4 results. I want to be able to capture the School ID, the School Name, and the School Address for all results returned. There may be multiple pages of results so its important to scrape all the results.

I've been trying to use RegExBuddy to do this but I can't get it to work.

The data returned in respect of each result is fairly consistent as follows:-

 <li class="document">
                <div>
                    <h3>
                        <a class="bold-small" href="/school/110182">The Castle School</a>
                    </h3>
                    <div class="comparsion-button-container">
                        <div id="JsAddRemoveError" class="optional-section no-js-hidden">
                            <p class="error-message">An error had occurred whilst trying to add or remove this school or college to comparison. Try again now or later.</p>
                        </div>
<a class="button button-comparison button-comparison-add" id="AddComparison110182" href="/addCompare/110182/searchResults/find-a-school-in-england?keywords=The+Castle+School&amp;suggestionurn=&amp;searchtype=search-by-name"
   data-js-url="/add-to-comparison-js/110182/searchResults">Add <span class="visuallyhidden">The Castle School </span>to comparison list</a>
                    </div>
                </div>

<dl class="metadata">


    <dt>Address<span aria-hidden="true">:</span></dt>
    <dd>Love Lane, Newbury, RG14 2JG</dd>

    <dt class="visuallyhidden">Phase of education<span aria-hidden="true">:</span></dt>
    <dd>Primary, Secondary and 16 to 18</dd>

        <dt>School type<span aria-hidden="true">:</span></dt>
            <dd>Special School</dd>


    <dt>Ofsted rating<span aria-hidden="true">:</span></dt>
    <dd>
        <span class="rating rating-1" aria-hidden="true">
            <span class="rating-text">
                1
            </span>
        </span>
        Outstanding
            <span class="rating-date">
                <span><span aria-hidden="true">(</span>Last inspection<span aria-hidden="true">:</span></span>
                <span>
                    <time datetime="2014-10-08">08 October 2014</time><span aria-hidden="true">)</span>
                </span>
            </span>
    </dd>


</dl>

<div style="clear: both;"></div>

Each result is encapsulated inside a

<li class=document">

and the school name and school id is found here:-

<a class="bold-small" href="/school/110182">The Castle School</a>

In this instance the school ID is 110182, the school name is The Castle School.

The address is also always caught between:-

<dd>Love Lane, Newbury, RG14 2JG</dd>

For an example of a result set that returns more than 1 page of results, you can use the word "Grammar"

I realise this is a big ask, but I have been trying to use RegExBuddy to try and create the right regular expression but can't seem to find the right answer.

If you have a better idea of a way to scrape the required information please let me know. I know they provide their data for download, however I don't want to do this as it would then involve storing that data and constantly updating it - whereas the data on their website will always be the most up to date.

Thanks.

EDIT: See Jan's answer with my comment. Very impressive answer.


Solution

  • As always, use a combination of parsing and regular expressions:

    <?php
    
    $url = 'https://www.compare-school-performance.service.gov.uk/?keywords=[SCHOOL-NAME]&suggestionurn=&searchtype=search-by-name';
    
    $previous_value = libxml_use_internal_errors(TRUE);
    
    $dom = new DOMDocument();
    $dom->loadHTMLFile($url);
    
    $xpath = new DOMXPath($dom);
    
    # regex part
    $regex = '~(?P<id>\d+)$~';
    
    # here comes the main part
    $schools = $xpath->query("//ul[@class = 'school-results-listing']//li");
    foreach($schools as $school) {
        $name = $xpath->query(".//h3/a/text()", $school)->item(0)->nodeValue;
        preg_match($regex, $xpath->query(".//h3/a/@href", $school)->item(0)->nodeValue, $match);
        $id = $match["id"];
    
        $address = $xpath->query(".//dl[@class = 'metadata']//dd/text()", $school)->item(0)->nodeValue;
        echo "Name: {$name}, ID: {$id}, Address: {$address} \n"; 
    }
    libxml_clear_errors();
    libxml_use_internal_errors($previous_value);
    
    ?>
    

    This loads the document in the DOM, traverses it and extracts the wanted information with the help of a simple regular expression for the id part.
    DO NOT use regular expression on the HTML directly.