I'm querying the IdRef Sparql endpoint to get researchers co-authors. In order to get more complete results, I'm doing a federated query against HAL endpoint.
My query works pretty well but generates duplicates, which I aim to de-duplicate using authorities identifiers (ORCID, ISNI or whatever).
So far, I achieved the following query, but now my problem is that one result is missing.
My query is:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT distinct ?aut ?auturi
WHERE {
SELECT distinct (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids
WHERE {
{
?uri ?rel <http://www.idref.fr/139753753/id>. #entities our author has a link with
?uri ?relcontrib ?auturi. #other with a link to these entities
?auturi a foaf:Person. #filter for persons
?auturi skos:prefLabel ?aut. #get authors' name
FILTER (?auturi != <http://www.idref.fr/139753753/id>) #exclude the same author we're querying
OPTIONAL {
?auturi owl:sameAs ?ids. #get authors' identifiers
}
} UNION {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
?idHal foaf:publications ?uri. #same as above
?auturi foaf:publications ?uri.
?auturi foaf:name ?aut.
FILTER (?idHal != ?auturi)
OPTIONAL {
?auturi owl:sameAs ?ids.
}
}
}
}
}
As you can see, I'm using a subquery with sample to perform the "de-duplication", but it doesn't work as expected (or at least as I'd expect): one result is stripped away. You can see here the un-sampled subquery, it returns an extra result matching this uri: https://data.archives-ouvertes.fr/author/marie-masclet-de-barbarin.rdf
At first I thought it was because this result had no matching owl:sameAs
object, but another result in the set doesn't either and yet is in the final results set.
I'm quite puzzled by this behaviour and I suspect it is because I don't fully understand how sample
works. Maybe there is a more accurate way to achieve what I'm looking for.
Edit: results (with duplicates) are as follow:
# auturi aut
1 http://www.idref.fr/057577889/id Lantenois, Annick (1956-....)
2 http://www.idref.fr/033888760/id Cubaud, Pierre
3 http://www.idref.fr/028984838/id Suber, Peter
4 http://www.idref.fr/165836652/id Cramer, Florian (1969-....)
5 http://www.idref.fr/050447823/id Mounier, Pierre (1970-....)
6 http://www.idref.fr/174428006/id Ena, Alexandra (19..-....)
7 http://www.idref.fr/052212807/id Lebert, Marie
8 https://data.archives-ouvertes.fr/author/pierre-mounier Pierre Mounier
9 https://data.archives-ouvertes.fr/author/patrice-bellot Patrice Bellot
10 https://data.archives-ouvertes.fr/author/marlene-delhaye Marlène Delhaye
11 https://data.archives-ouvertes.fr/author/denis-bertin Denis Bertin
12 https://data.archives-ouvertes.fr/author/emma-bester Emma Bester
13 https://data.archives-ouvertes.fr/author/marie-masclet-de-barbarin Marie Masclet de Barbarin
Basically the only duplicates are #5 & #8. They can be identified as such because they share a common ?ids
object (not shown in results here for clarity. See full results, with ?ids
, here)
Marie Masclet de Barbarin is hidden precisely because there is another person, Emma Bester, who also does not have an owl:sameAs
edge.
Consider this query:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?auturi ?aut ?ids
WHERE {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
?idHal foaf:publications ?uri. #same as above
?auturi foaf:publications ?uri.
?auturi foaf:name ?aut.
FILTER (?idHal != ?auturi)
OPTIONAL {
?auturi owl:sameAs ?ids.
}
}
}
Notice that many of these people have multiple values of owl:sameAs
, and they are all different between each other.
However, Marie and Emma have no value, so the database assigns them a 'null' value.
So, when sampling the author name and uri (grouping by ?ids
), we can use the following query:
SELECT DISTINCT (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids
WHERE {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
?idHal foaf:publications ?uri. #same as above
?auturi foaf:publications ?uri.
?auturi foaf:name ?aut.
FILTER (?idHal != ?auturi)
OPTIONAL {
?auturi owl:sameAs ?ids.
}
}
}
This only has 11 results however, with Marie missing:
Why? Because the ?ids
has a null value for two separate authors, and by sampling we are asking for only one of these authors, so the second one gets skipped.
So why is Marie skipped 100% of the time and not 50%? Most likely this is determined by the order in which the triples were loaded into the store, so the SAMPLE
function is deterministic given a certain loading sequence, i.e. if you took the data and loaded it into a different machine with possibly a different triplestore, it is possible that Emma would be the one that is skipped.
How to solve this?
The hard part is that Pierre Mounier exists as almost two different entities, with two ?ids
and even two text names, "Pierre Mounier"
and "Mounier, Pierre (1970-...)"
.
Thus, the obvious solution of sampling ?auturi
and grouping by ?aut
will show Marie, but also will not deduplicate Pierre.
A better solution would be to use COALESCE
to bind ?ids
to something different for each author, instead of letting it be null for both. This is done like this:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?auturi ?aut ?idsClean
WHERE {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
?idHal foaf:publications ?uri. #same as above
?auturi foaf:publications ?uri.
?auturi foaf:name ?aut.
FILTER (?idHal != ?auturi)
OPTIONAL {
?auturi owl:sameAs ?ids.
}
BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?idsClean)
}
}
Putting this method to work in the larger query, we obtain:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT distinct ?aut ?auturi
WHERE {
SELECT distinct (SAMPLE(?auturi) AS ?auturi) (SAMPLE(?aut) AS ?aut) ?ids_clean
WHERE {
{
?uri ?rel <http://www.idref.fr/139753753/id>. #entities our author has a link with
?uri ?relcontrib ?auturi. #other with a link to these entities
?auturi a foaf:Person. #filter for persons
?auturi skos:prefLabel ?aut. #get authors' name
FILTER (?auturi != <http://www.idref.fr/139753753/id>) #exclude the same author we're querying
OPTIONAL {
?auturi owl:sameAs ?ids. #get authors' identifiers
}
BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?ids_clean)
} UNION {
<http://www.idref.fr/139753753/id> owl:sameAs ?id.
FILTER (CONTAINS(STR(?id), "archives-ouvertes.fr"))
BIND(URI(REPLACE(STR(?id), "#.*", "")) as ?idHal) #get an ID to query HAL
SERVICE <http://sparql.archives-ouvertes.fr/sparql> {
?idHal foaf:publications ?uri. #same as above
?auturi foaf:publications ?uri.
?auturi foaf:name ?aut.
FILTER (?idHal != ?auturi)
OPTIONAL {
?auturi owl:sameAs ?ids.
}
BIND(COALESCE(?ids, CONCAT("No ID: ", ?aut)) AS ?ids_clean)
}
}
}
}