Search code examples
rdfwikidatan-triples

Parsing Wikidata n-triples data


I'm working with WikiData and RDF for the first time. I downloaded the WikiData 24GB "truthy" dataset (available only in N-Triples .nt format), but now I have a hard time understanding it.

Here are some lines from the .nt file related to Jack Bauer showing (subject, predicate, object) triples:

<http://www.wikidata.org/entity/Q24> <http://schema.org/description> "protagonista della serie televisiva americana ''24''"@it .

<http://www.wikidata.org/entity/Q24> <http://schema.org/name> "Jack Bauer"@en .

<http://www.wikidata.org/entity/Q24> <http://www.wikidata.org/prop/direct/P27> <http://www.wikidata.org/entity/Q30> .

<http://www.wikidata.org/entity/Q24> <http://www.wikidata.org/prop/direct/P451> <http://www.wikidata.org/entity/Q284262> .

So my questions are:

  1. Are all the URIs for the triples resolvable to English from this one giant file, or do I have to make API calls? For example, I want to resolve this triple:
<http://www.wikidata.org/entity/Q24> <http://www.wikidata.org/prop/direct/P27> <http://www.wikidata.org/entity/Q30> .

into an English human-readable form like this:

Jack Bauer, country of citizenship, United States of America

Does this file contain the needed information to get the English-readable names for Q24, P27, and Q30? Or would I have to make separate HTTP API calls to resolve them?

  1. I can also get a .json dump of this file. Am I correct in understanding is that the .nt triples are simply a depth-first traversal of the JSON hierarchy to flatten all the data into triples?

Solution

  • Are all the URIs for the triples resolvable to English from this one giant file, or do I have to make API calls?

    Resolving the triples to English would need a different representation of the triples like http://wiki.bitplan.com/index.php/SiDIF. Most RDF serializations are not very well readable for humans. https://www.w3.org/TR/turtle/ being one of the more readable ones. https://gbv.github.io/aREF/aREF.html is also a good idea. The general toolchain for RDF is not so programmer friendly. See JSON-LD and Why I Hate the Semantic Web

    You might want to import the triples into a SPARQL store and then use a query frontend for it. That will simplify your life a lot. It's the kind of "API" you might have been thinking of.

    See http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData for a description of the procedure. As of 2020-05-11 i am e.g. importing to Apache Jena.

    The https://query.wikidata.org/ query frontend might be easier to use for simple queries. Please find below a query that represents the triples you found.

    # WikiData statements about Jack Bauer
    SELECT ?pLabel ?oLabel 
    WHERE 
    {
      wd:Q24 ?p ?o.
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
    }
    

    try it!

    Does this file contain the needed information to get the English-readable names for Q24, P27, and Q30? Or would I have to make separate HTTP API calls to resolve them? The file should contain the information since "truthy" only means you don't have the provenance data but all fact data. Working with WikiData can be quite cumbersome see http://wiki.bitplan.com/index.php/WikiData. There are libraries out there that will help you deal via a programming languages API with WikiData directly. E.g. https://github.com/Wikidata/Wikidata-Toolkit for Java. See https://www.wikidata.org/wiki/Wikidata:Tools/For_programmers for a more comprehensive list.

    I can also get a .json dump of this file. Am I correct in understanding is that the .nt triples are simply a depth-first traversal of the JSON hierarchy to flatten all the data into triples?

    The content of the triples should be the same. I am not sure what the order of the triples in the json dump are. The bad news is that it's not sufficient to import just a part of the dump because you'll loose the link information.