Search code examples
phpxmlopenstreetmapnominatim

Failed to load external entity on simplexml_load_file at Openstreetmap


I recently checked one of our websites and realized that the search for postal code wasn't working anymore.

I get the following error:

'Failed to load external entity'

If instead I use simplexml_load_string() I receive

'Start tag expected, '<' not found'.

This is the code I'm using:

libxml_use_internal_errors(true);
$xml = simplexml_load_file('https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode');
if (false === $xml) {
    $errors = libxml_get_errors();
    var_dump($errors);
}

I read somewhere it might actually has something to do with HTTP headers but I did not find any useful info on this.


Solution

  • In OSM Nominatim's usage policy it is stated that you need to provide a User-Agent or HTTP-Referer request header to identify the application. As such, using a user-agent to masquerade as end-user browser is really not great etiquette.

    You can find the usage policy here. It also says that the default values used by http libraries (like the one simplexml_load_file() uses) are not acceptable.

    You say you are using simplexml_load_string(), but fail to say how are you getting the XML to that function. But the most likely scenario is that whichever method you are using to get the XML file, you are also neglecting to pass the mandatory headers.

    As such, I'd create a request using php-curl, provide one of these headers to identify your app; and parse the resulting XML string with simplexml_parse_string().

    E.g.:

    // setup variables
    $nominatim_url = 'https://nominatim.openstreetmap.org/search?postalcode=28217&country=DE&format=xml&polygon=1&addressdetails=1&boundary=postalcode';
    $user_agent    = 'ID_Identifying_Your_App v100';
    $http_referer  = 'http://www.urltoyourapplication.com';
    $timeout       = 10;
    
    // curl initialization
    $ch         = curl_init();
    curl_setopt($ch, CURLOPT_URL, $nominatim_url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    
    // this is are the bits you are missing
    // Setting curl's user-agent
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent); 
    // you an also use this one (http-referer), it's up to you. Either one or both.
    curl_setopt($ch, CURLOPT_REFERER, $http_referer); 
    
    // get the XML
    $data = curl_exec($ch);
    curl_close($ch);
    
    // load it in simplexml
    $xml = simplexml_load_string($data);
    // This was your code, left as it was
    if (false === $xml) {
        $errors = libxml_get_errors();
        var_dump($errors);
    }