Search code examples
xmlxpathxqueryxqilla

XQuery max segmentation fault


I'm trying to use XQuery/XPath with the XML version of the mondial database to find the two cities that are furthest apart from each other, only taking into account those with a population higher than 500,000. For calculating the distances, I'm assuming a Mercator projection.

Here is my code:

let $db := doc("mondial.xml")

(: Cities with 500,000+ population :)
let $cities := (for $city in $db/mondial/country//city
(: Compare using latest population data :)
where $city/population[@year = max($city/population/@year)] > 500000
return $city)

(: City pairs and the distances between them (not
not square-rooted as all we care about is order) :)
let $distancesSquared := (for $city1 in $cities
for $city2 in $cities
    where $city1 != $city2
    return <distancesquared city1="{ data($city1/name) }" city2="{ data($city2/name) }">{ (
        let $diffLong := number($city1/longitude) - number($city2/longitude)
        let $diffLat := number($city1/latitude) - number($city2/latitude)
        return $diffLong * $diffLong + $diffLat * $diffLat
    ) }</distancesquared>)

let $max := max($distancesSquared)

return $distancesSquared[data(.) = $max]

But for some reason, I'm getting a segmentation fault at this line:

let $max := max($distancesSquared)

I'm using xqilla to run the query like this:

xqilla -o query.xml query.xq

Any ideas what could be wrong with my code?


Solution

  • This is a pretty awful query because the amount of intermediate data varies quadratically with the amount of input data.

    Because the variable $distancesSquared is referenced twice, it will almost certainly be materialised in memory, and it is likely to be huge.

    First check whether the fault still happens with smaller amounts of data. If it happens even with small datasets, then it's nothing wrong with your code, it's a bug in XQuilla.

    If it happens only with large datasets, then you are probably exceeding system limits of some kind (possibly memory); check if there are any configuration parameters you can adjust.

    Better: find an improved algorithm.

    As a first step, try an algorithm that is still quadratic in time, but no longer quadratic in memory: specifically, for each city, find the city that is furthest away; build a list in the form

    <city name="London" furthest="Christchurch" distance="8000"/>
    

    etc; then find the maximum in this list of city-pairs.

    Smarter still, there are ways of solving this problem without examining all pairs of cities. I'm afraid it's a while since I studied such algorithms so I don't remember the details. But the general idea is to allocate each city to a zone, work out the extreme distances between points in each pair of zones, and then when finding the furthest cities from a given city, only process the zones that are sufficiently distant.