Search code examples
duplicatessparqlsample

Removing duplicates: unexpected Sparql Sample behaviour, missing result


I'm querying the IdRef Sparql endpoint to get researchers co-authors. In order to get more complete results, I'm doing a federated query against HAL endpoint.

My query works pretty well but generates duplicates, which I aim to de-duplicate using authorities identifiers (ORCID, ISNI or whatever).

So far, I achieved the following query, but now my problem is that one result is missing.

My query is:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT distinct ?aut ?auturi
WHERE {
  SELECT distinct (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids
  WHERE {
    {
      ?uri ?rel <http://www.idref.fr/139753753/id>. #entities our author has a link with
      ?uri ?relcontrib ?auturi. #other with a link to these entities
      ?auturi a foaf:Person. #filter for persons
      ?auturi skos:prefLabel ?aut. #get authors' name
      FILTER (?auturi != <http://www.idref.fr/139753753/id>) #exclude the same author we're querying
      OPTIONAL {
        ?auturi owl:sameAs ?ids. #get authors' identifiers
      }
    } UNION {
      <http://www.idref.fr/139753753/id> owl:sameAs ?id.
      FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
      BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
      SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
        ?idHal foaf:publications ?uri. #same as above
        ?auturi foaf:publications ?uri.
        ?auturi foaf:name ?aut.
        FILTER (?idHal != ?auturi)
        OPTIONAL {
          ?auturi owl:sameAs ?ids.
        }
      }
    }
  }
}

As you can see, I'm using a subquery with sample to perform the "de-duplication", but it doesn't work as expected (or at least as I'd expect): one result is stripped away. You can see here the un-sampled subquery, it returns an extra result matching this uri: https://data.archives-ouvertes.fr/author/marie-masclet-de-barbarin.rdf

At first I thought it was because this result had no matching owl:sameAs object, but another result in the set doesn't either and yet is in the final results set.

I'm quite puzzled by this behaviour and I suspect it is because I don't fully understand how sample works. Maybe there is a more accurate way to achieve what I'm looking for.

Edit: results (with duplicates) are as follow:

# auturi  aut
1   http://www.idref.fr/057577889/id Lantenois, Annick (1956-....)
2   http://www.idref.fr/033888760/id Cubaud, Pierre
3   http://www.idref.fr/028984838/id Suber, Peter
4   http://www.idref.fr/165836652/id Cramer, Florian (1969-....)
5   http://www.idref.fr/050447823/id Mounier, Pierre (1970-....)
6   http://www.idref.fr/174428006/id Ena, Alexandra (19..-....)
7   http://www.idref.fr/052212807/id Lebert, Marie
8   https://data.archives-ouvertes.fr/author/pierre-mounier Pierre Mounier
9   https://data.archives-ouvertes.fr/author/patrice-bellot Patrice Bellot
10 https://data.archives-ouvertes.fr/author/marlene-delhaye Marlène Delhaye
11 https://data.archives-ouvertes.fr/author/denis-bertin Denis Bertin
12 https://data.archives-ouvertes.fr/author/emma-bester Emma Bester
13 https://data.archives-ouvertes.fr/author/marie-masclet-de-barbarin Marie Masclet de Barbarin

Basically the only duplicates are #5 & #8. They can be identified as such because they share a common ?ids object (not shown in results here for clarity. See full results, with ?ids, here)


Solution

  • Marie Masclet de Barbarin is hidden precisely because there is another person, Emma Bester, who also does not have an owl:sameAs edge. Consider this query:

    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    
    SELECT DISTINCT ?auturi ?aut ?ids
      WHERE {
    <http://www.idref.fr/139753753/id> owl:sameAs ?id.
          FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
          BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
          SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
            ?idHal foaf:publications ?uri. #same as above
            ?auturi foaf:publications ?uri.
            ?auturi foaf:name ?aut.
            FILTER (?idHal != ?auturi)
            OPTIONAL {
              ?auturi owl:sameAs ?ids.
            }
      }
    }
    

    This yields 12 results: enter image description here

    Notice that many of these people have multiple values of owl:sameAs, and they are all different between each other. However, Marie and Emma have no value, so the database assigns them a 'null' value.

    So, when sampling the author name and uri (grouping by ?ids), we can use the following query:

    SELECT DISTINCT (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids
      WHERE {
    <http://www.idref.fr/139753753/id> owl:sameAs ?id.
          FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
          BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
          SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
            ?idHal foaf:publications ?uri. #same as above
            ?auturi foaf:publications ?uri.
            ?auturi foaf:name ?aut.
            FILTER (?idHal != ?auturi)
            OPTIONAL {
              ?auturi owl:sameAs ?ids.
            }
      }
    }
    

    This only has 11 results however, with Marie missing: enter image description here

    Why? Because the ?ids has a null value for two separate authors, and by sampling we are asking for only one of these authors, so the second one gets skipped.

    So why is Marie skipped 100% of the time and not 50%? Most likely this is determined by the order in which the triples were loaded into the store, so the SAMPLE function is deterministic given a certain loading sequence, i.e. if you took the data and loaded it into a different machine with possibly a different triplestore, it is possible that Emma would be the one that is skipped.

    How to solve this? The hard part is that Pierre Mounier exists as almost two different entities, with two ?ids and even two text names, "Pierre Mounier" and "Mounier, Pierre (1970-...)". Thus, the obvious solution of sampling ?auturi and grouping by ?aut will show Marie, but also will not deduplicate Pierre.

    A better solution would be to use COALESCE to bind ?ids to something different for each author, instead of letting it be null for both. This is done like this:

    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    
    SELECT DISTINCT ?auturi ?aut ?idsClean
      WHERE {
    <http://www.idref.fr/139753753/id> owl:sameAs ?id.
          FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
          BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
          SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
            ?idHal foaf:publications ?uri. #same as above
            ?auturi foaf:publications ?uri.
            ?auturi foaf:name ?aut.
            FILTER (?idHal != ?auturi)
            OPTIONAL {
              ?auturi owl:sameAs ?ids.
            }
        BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?idsClean)
      }
    }
    

    This will return: enter image description here

    Putting this method to work in the larger query, we obtain:

    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    
    SELECT distinct ?aut ?auturi
    WHERE {
      SELECT distinct (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids_clean
      WHERE {
        {
          ?uri ?rel <http://www.idref.fr/139753753/id>. #entities our author has a link with
          ?uri ?relcontrib ?auturi. #other with a link to these entities
          ?auturi a foaf:Person. #filter for persons
          ?auturi skos:prefLabel ?aut. #get authors' name
          FILTER (?auturi != <http://www.idref.fr/139753753/id>) #exclude the same author we're querying
          OPTIONAL {
            ?auturi owl:sameAs ?ids. #get authors' identifiers
          }
        BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?ids_clean)
        } UNION {
          <http://www.idref.fr/139753753/id> owl:sameAs ?id.
          FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
          BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
          SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
            ?idHal foaf:publications ?uri. #same as above
            ?auturi foaf:publications ?uri.
            ?auturi foaf:name ?aut.
            FILTER (?idHal != ?auturi)
            OPTIONAL {
              ?auturi owl:sameAs ?ids.
            }
            BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?ids_clean)
          }
        }
      }
    }
    

    And this yields the correct 12 results: enter image description here