I'm just getting to grips with linked data, and of course, DBpedia, in hopes that it might be helpful in my work.
I am just trying to write a few SPARQL queries to get acquainted with the data and technology, but I am horrified by the results and am wondering if maybe I'm not getting a core concept here. For example, if I want DBpedia to give me a list of all countries, I would imagine, naively, that every country is "of type" dbo:country
and also that if something is "of type" dbo:country
, then that something should surely be a country.
So, I guess the naive SPARQL query to return all countries would simply be
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?concept
WHERE {?concept a dbo:Country}
Now, this query returns a lot of things I expect of it. Existing countries, ex countries, countries that are part of other countries, and of course, the Finland national cricket team.
Wait, WHAT?!?!?!
Why would this query return the Finland national cricket team? Surely, that can't be an entity of type Country, can it? Let me se...
http://dbpedia.org/page/Finland_national_cricket_team
Oh. It can be.
Is my understanding that this is a DBpedia mistake correct or not? Is all of link data similarly polluted with outliers? I mean, there's more strange things in what my query returns, like Great Britain's basketball team, the Indiana democratic party, the United States Ambassador to Pakistan and so on. Is this pollution a given or am I simply missing a point of view here?
Is my understanding that this is a DBpedia mistake correct or not?
Yes, I believe so. If you look closely, you will notice that Finland national cricket team is dbo:country
of dbr:Jonathan_October
. I don't quite understand why is it that, but I think this is the source of the issue.
Is all of link data similarly polluted with outliers?
I don't think so, it always depends on the source of the data. But if it's something like automatically extracted data from Wikipedia, there will always be issues (though most of the time hopefully not on this scale).