Search code examples
sparqlwiktionary

How to get all nouns in a certain language from Wiktionary using SPARQL


I'm trying to query Wiktionary with SPARQL to get all the terms that are nouns of a certain language (for example German) and as output:

  • the string of the noun
  • the grammatical gender (genus): male, female, neutral

I am using the SPARQL-Endpoint: http://wiktionary.dbpedia.org/sparql and I found an example but I didn't figure out how to adapt it to get the information I want.

PREFIX terms:<http://wiktionary.dbpedia.org/terms/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc:<http://purl.org/dc/elements/1.1/>
SELECT ?sword ?slang ?spos ?ssense ?twordRes ?tword ?tlang
FROM <http://wiktionary.dbpedia.org>
WHERE {
    ?swordRes terms:hasTranslation ?twordRes .
    ?swordRes rdfs:label ?sword .
    ?swordRes dc:language ?slang .
    ?swordRes terms:hasPoS ?spos .
    OPTIONAL { ?swordRes terms:hasMeaning ?ssense . }
    OPTIONAL { 
           ?twordBaseRes terms:hasLangUsage ?twordRes . 
           ?twordBaseRes rdfs:label ?tword .
    }
    OPTIONAL { ?twordRes dc:language ?tlang . }
}

Solution

  • First of all, you want to select all term senses that are nouns. As you can see in the query result of the example query, this information is captured by the terms:hasPoS relation. So, to specifically query all nouns, we could do this:

    PREFIX terms: <http://wiktionary.dbpedia.org/terms/>
    SELECT ?term
    WHERE { 
         ?term terms:hasPoS terms:Noun . 
    }
    LIMIT 100 
    

    Result

    The next thing you want is only nouns of a certain language. This seems to be covered by the dc:language relation, so we add an additional constraint on that relation. Let's say we want all English nouns:

    PREFIX terms: <http://wiktionary.dbpedia.org/terms/>
    PREFIX dc: <http://purl.org/dc/elements/1.1/>
    
    SELECT ?term
    WHERE { 
        ?term terms:hasPoS terms:Noun ;
              dc:language terms:English . 
    }
    LIMIT 100 
    

    Result

    So, we are now selecting what you want, but we don't yet have the output in the format you want, as the above query just gives back the identifier of the term sense, not the string-value of the actual term. As we can see in the output from the example query, the string value is captured by the rdfs:label property, so we add that:

    PREFIX terms: <http://wiktionary.dbpedia.org/terms/>
    PREFIX dc: <http://purl.org/dc/elements/1.1/>
    PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
    
    SELECT ?term ?termLabel
    WHERE { 
        ?term terms:hasPoS terms:Noun ;
              dc:language terms:English ;
              rdfs:label ?termLabel .
    }
    LIMIT 100
    

    Result

    If you now look at this query's result you'll see that there is something odd with the language going on: despite the fact that we thought we selected English, we are also getting back labels that have a different language tag (e.g. '@ru'). To remove these results we can restrict our query further, and say that we only want back labels in English:

    PREFIX terms: <http://wiktionary.dbpedia.org/terms/>
    PREFIX dc: <http://purl.org/dc/elements/1.1/>
    PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
    
    SELECT ?term ?termLabel
    WHERE { 
        ?term terms:hasPoS terms:Noun ;
              dc:language terms:English ;
              rdfs:label ?termLabel .
        FILTER(langMatches(lang(?termLabel), "en"))
    }
    LIMIT 100
    

    Result

    Finally, the gender/genus. Here I'm not really sure. Looking at some example resources in the wiktionary data (for example, the entry for dog) I'd say this information is not actually present in the data.