Search code examples
sparqldbpedia

how to remove duplicates in sparql query


I wrote this query and return list of couples and particular condition. ( in http://live.dbpedia.org/sparql)

SELECT DISTINCT ?actor ?person2 ?cnt
WHERE
{
{
    select DISTINCT ?actor ?person2 (count (?film) as ?cnt) 
    where { 
        ?film    dbo:starring ?actor .
        ?actor dbo:spouse ?person2. 
        ?film    dbo:starring ?person2.
    }
    order by ?actor
}
FILTER (?cnt >9)
}

Problem is that some rows is duplicate. example:

http://dbpedia.org/resource/George_Burns http://dbpedia.org/resource/Gracie_Allen 12

http://dbpedia.org/resource/Gracie_Allen http://dbpedia.org/resource/George_Burns 12

how to remove these duplications? I added gender to ?actor but it damage current result.


Solution

  • Natan Cox's answer shows the typical way to exclude these kind of pseudo-duplicates. The results aren't actually duplicates, because in one, e.g., George Burns is the ?actor, and in the other he is the ?person2. In many cases, you can add a filter to require that the two things are ordered, and that will remove the duplicate cases. E.g., when you have data like:

    :a :likes :b .
    :a :likes :c .
    

    and you search for

    select ?x ?y where { 
      :a :likes ?x, ?y .
    }
    

    you can add filter(?x < ?y) to enforce an ordering between the between ?x and ?y which will remove these pseudo-duplicates. However, in this case, it's a bit trickier, since ?actor and ?person2 aren't found using the same critera. If DBpedia contains

    :PersonB dbo:spouse :PersonA
    

    but not

    :PersonA dbo:spouse :PersonB
    

    then the simple filter won't work, because you'll never find the triple where the subject PersonA is less than the object PersonB. So in this case, you also need to modify your query a bit to make the criteria symmetric:

    select distinct ?actor ?spouse (count(?film) as ?count) {
      ?film dbo:starring ?actor, ?spouse .
      ?actor dbo:spouse|^dbo:spouse ?spouse .
      filter(?actor < ?spouse)
    }
    group by ?actor ?spouse
    having (count(?film) > 9)
    order by ?actor
    

    (This query also shows that you don't need a subquery here, you can use having to "filter" on aggregate values.) But the important part is using the property path dbo:spouse|^dbo:spouse to find a value for ?spouse such that either ?actor dbo:spouse ?spouse or ?spouse dbo:spouse ?actor. This makes the relationship symmetric, so that you're guaranteed to get all the pairs, even if the relationship is only declared in one direction.