Search code examples
gremlinwikidataamazon-neptunewikimedia-dumps

Sample code to convert Wikidata dumps to Gremlin format


Could you share a sample code to convert Wikidata dumps to Gremlin format, please?

I would like to load the converted Gremlin CSV data into AWS Neptune.


Solution

  • As discussed in your other question, Amazon Neptune will happily load that RDF format data directly, but you would need to query it using SPARQL. Unless you absolutely need to get the data into property graph format, loading the data as-is and using SPARQL would get you up and running very quickly.

    To use Gremlin or openCypher that data will need to be converted to an equivalent property graph form. You really have a couple of options:

    1. Convert the RDF format data into equivalent CSV file format so that the Neptune bulk loader can load it for you.
    2. Convert the RDF format data into Gremlin addV and addE steps, or openCypher CREATE and MERGE clauses.

    If you have a lot of data to load, the CSV files and bulk loader will be the easier route.

    Converting from RDF format to property graph format is very easy. I'm aware of tools that go the other way (CSV to RDF) but not of one that will take a TTL file (let's say) and turn that into CSV.

    If you are comfortable writing a little code, all you really need is a Python or Ruby script, then converting this data is quite straightforward. You just have to convert the triple patterns into nodes and edges (with properties).

    So, imagine in the RDF you have triples that are essentially in this form

    max a dog 
    fido a dog 
    max age 3 
    fido age 6 
    max likes fido
    

    You would end up creating two nodes, two properties and an edge.

    In CSV form the nodes would like like

    ~id, ~label, age
    max,dog,3
    fido,dog,6
    

    and the edge would be

    ~id,~label,~from,~to
    e1,likes,max,fido
    

    If you plan on converting all the data, and it is just too much for a script based approach, using a big data ETL approach, such as Spark, is likely the way to go. Many ways to approach this. Not a super hard task. I'm just not aware of a tool that will do it for you (there may be one, but I'm just not aware of anything).