Search code examples
groupingsparqllimitdata-cube

Select top X results per group


I have a bunch of RDF Data Cube observations that have an attached attribute, in my case the date on when that value was recorded.

The pattern is simple, for example (leaving out other dimension/measure/attributes):

<obs1> a qb:Observation ;
  my:lastupdate '2017-12-31'^^xsd:date ;
  qb:dataSet <dataSet1> .

<obs2> a qb:Observation ;
  my:lastupdate '2016-12-31'^^xsd:date ;
  qb:dataSet <dataSet1> .

<obs2_1> a qb:Observation ;
  my:lastupdate '2017-12-31'^^xsd:date ;
  qb:dataSet <dataSet2> .

<obs2_2> a qb:Observation ;
  my:lastupdate '2015-12-31'^^xsd:date ;
  qb:dataSet <dataSet2> .

So I have multiple qb:DataSet in my store. Now I would like to figure out the last X my:lastupdate values per dataset. Let's say I want the last 5 values, for each particular DataSet.

I can do that very easily for one particular dataset:

SELECT * WHERE {

  ?observation my:lastupdate ?datenstand ;
                 qb:dataSet <dataSet1>                                                                                                                                            
} ORDER BY DESC(?datenstand) LIMIT 5    

But I'm a bit lost if this is at all possible within a single SPARQL query, per dataset. I tried various combination with sub-selects, LIMIT & GROUP BY combinations but nothing lead to the result I am looking for.


Solution

  • This query pattern was discussed at length on the now defunct SemanticOverflow Q+A site as 'get the 3 largest cities for each country' and the general consensus was that queries in the form 'get the top n related items for each master item' are not manageable with a single SPARQL query in an efficient way.

    The core issue is that nested queries are evaluated bottom-up and GROUP/LIMIT clauses will apply to the whole result set rather than to each group.

    The only useful exception to the bottom-up rule are (not) exists filters, which have visibility on current bindings. You can take advantage of this fact to write queries like:

    select ?country ?city ?population where {
    
        ?country a :Country; :city ?city.
        ?city :population ?population.
    
        filter not exists { select * where {
    
            ?country :city ?_city.
            ?_city :population ?_population.
    
            filter ( ?_population > ?population )
    
        } offset 3 }
    
    } order by ?country desc(?population)
    

    Unfortunately this approach is not usually viable on large real-world datasets, as it involves scanning and filtering the cartesian product of each country/city group.