Search code examples
sparqlwdqs

Identify entity of a Wikipedia page


My question is related to a similar question/comment which unfortunately never received an answer.

Given a list of multiple Wikipedia pages, e.g.:

how can I find out what type of entity these articles refer to. i.e. ideally I would want something on a higher level e.g. person, movie, animal etc.

My best guess so far was the Wikidata API using SPARQL to move back the instance_of or subclass tree. However, this did not lead to meaningful results.

SELECT ?lemma ?item ?itemLabel ?itemDescription ?instance ?instanceLabel ?subclassLabel WHERE {
  VALUES ?lemma {
    "Donald Trump"@en
    "The Matrix"@en
    "Tiger" @en
  }
  ?sitelink schema:about ?item;
    schema:isPartOf <https://en.wikipedia.org/>;
    schema:name ?lemma.
  ?item wdt:P31* ?instance.
  ?item wdt:P279* ?subclass.
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en,da,sv".}
}

The result can be seen here: https://w.wiki/ZmQ

One option would of course also be to look at the itemDescription, but I'm afraid that this is too granular to build meaningful groups from larger lists and count frequencies later on. Does anyone have a hint/idea on how to get more general entity categories? Maybe also from the mediawiki API?

Any input would be highly appreciated!


Solution

  • Here are three possibilities, side-by-side:

    SELECT ?lemma ?item (GROUP_CONCAT(DISTINCT ?instanceLabel; SEPARATOR = " ") AS ?a) (GROUP_CONCAT(DISTINCT ?subclassLabel; SEPARATOR = " ") AS ?b) (GROUP_CONCAT(DISTINCT ?isaLabel; SEPARATOR = " ") AS ?c) WHERE {
      VALUES ?lemma {
        "Donald Trump"@en
        "The Matrix"@en
        "Tiger"@en
      }
      ?sitelink schema:about ?item;
        schema:isPartOf <https://en.wikipedia.org/>;
        schema:name ?lemma.
      OPTIONAL { ?item (wdt:P31/(wdt:P279*)) ?instance. }
      OPTIONAL { ?item wdt:P279 ?subclass. }
      OPTIONAL { ?item wdt:P31 ?isa. }
      SERVICE wikibase:label {
        bd:serviceParam wikibase:language "en,da,sv".
        ?instance rdfs:label ?instanceLabel.
        ?subclass rdfs:label ?subclassLabel.
        ?isa rdfs:label ?isaLabel.
      }
        # Here, you could add: FILTER(?instanceLabel in ("mammal"@en, "movie"@en, "musical"@en (and so on...)))
    }
    GROUP BY ?lemma ?item
    

    Live here.

    If you're looking at labels such as "film" and "mammal", i. e. a couple dozen at most, you could explicitly list them in order of preference, then use the first one that occurs.

    Note that you may be running into this bug: https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial#wikibase:Label_and_aggregations_bug