Search code examples
arangodbaql

Return distinct and sorted query in AQL


So I have two collections, one with cities with an array of postal codes as a property and one with postal codes and their latitude & longitude.

I want to return the cities closest to a coordinate. This is easy enough with a geo index but the issue I'm having is the same city being returned multiple times and some times it can be the 1st and 3rd closest because the postal code that I'm searching in bordering another city.

cities example data:

[
  {
    "_key": "30936019",
    "_id": "cities/30936019",
    "_rev": "30936019",
    "countryCode": "US",
    "label": "Colorado Springs, CO",
    "name": "Colorado Springs",
    "postalCodes": [
      "80904",
      "80927"
    ],
    "region": "CO"
  },
  {
    "_key": "30983621",
    "_id": "cities/30983621",
    "_rev": "30983621",
    "countryCode": "US",
    "label": "Manitou Springs, CO",
    "name": "Manitou Springs",
    "postalCodes": [
      "80829"
    ],
    "region": "CO"
  }
]

postalCodes example data:

[
  {
    "_key": "32132856",
    "_id": "postalCodes/32132856",
    "_rev": "32132856",
    "countryCode": "US",
    "location": [
      38.9286,
      -104.6583
    ],
    "postalCode": "80927"
  },
  {
    "_key": "32147422",
    "_id": "postalCodes/32147422",
    "_rev": "32147422",
    "countryCode": "US",
    "location": [
      38.8533,
      -104.8595
    ],
    "postalCode": "80904"
  },
  {
    "_key": "32172144",
    "_id": "postalCodes/32172144",
    "_rev": "32172144",
    "countryCode": "US",
    "location": [
      38.855,
      -104.9058
    ],
    "postalCode": "80829"
  }
]

The following query works but as an ArangoDB newbie I'm wondering if there's a more efficient way to do this:

FOR p IN WITHIN(postalCodes, 38.8609, -104.8734, 30000, 'distance')
    FOR c IN cities
        FILTER p.postalCode IN c.postalCodes AND c.countryCode == p.countryCode
        COLLECT close = c._id AGGREGATE distance = MIN(p.distance)
        FOR c2 IN cities
            FILTER c2._id == close
            SORT distance
            RETURN c2

Solution

  • The first FOR in the query will use the geo index and probably return few documents (just the postal codes around the specified location). The second FOR will look up the city for each found postal code. This may be an issue, depending on whether there is an index present on cities.postalCodes and cities.countryCode. If not, then the second FOR has to do a full scan of the cities collection each time it is involved. This will be inefficient. It may therefore be create an index on the two attributes like this:

    db.cities.ensureIndex({ type: "hash", fields: ["countryCode", "postalCodes[*]"] });

    The third FOR can be removed entirely when not COLLECTing by c._id but by c:

    FOR p IN WITHIN(postalCodes, 38.8609, -104.8734, 30000, 'distance')
      FOR c IN cities
        FILTER p.postalCode IN c.postalCodes AND c.countryCode == p.countryCode
        COLLECT city = c AGGREGATE distance = MIN(p.distance)
        SORT distance
        RETURN city
    

    This will shorten the query string, but it may not help efficiency much I think, as the third FOR will use the primary index to look up the city documents, which is O(1).

    In general, when in doubt about a query using indexes, you can use db._explain(queryString) to show which indexes will be used by a query.