Search code examples
rdftriplestoreaccumulo

Convert triple store to Accumulo store


I have rdf files in xml format. I am building out a graph db, but was hoping everything could be in Accumulo format instead. Basically I would just need to add an id, visibility, and datetime to each triple entry.

See link for Accumulo format https://accumulo.apache.org/docs/2.x/getting-started/design

Is it possible to take the existing triple format and add these values, or do I need to start from scratch?


Solution

  • Accumulo is said to be "schema-less" because it does not enforce any particular application schema. In order for you to use Accumulo, you must define your own schema based on your particular application's requirements, and impose that on how you build your keys.

    Different key components serve different roles in Accumulo:

    • The row portion of the key serves as a primary ordering (like a primary key in a relational database).
    • Column families serve the function of logical grouping of columns and provide a mechanism to define locality groups for fast lookups of common subsets of columns (like a columnar database whose columnar-ness is tunable).
    • Column qualifiers serve the purpose of traditional database column names, as a granular description.
    • Column visibilities are used for label based access controls.
    • Timestamps are used for versioning data stored in Accumulo.

    Depending on how you want your application to work, you could store something like subject:predicate:object as the the row. Or, you could store subject in the row (assuming low cardinality subjects or appended with some random bits to allow tablets to split, if high cardinality), something like type in the column family and predicate:object in the column qualifier. Or you could use the column family for the predicate. Or, if you primarily search by relationship instead of subject, you could store the type or predicate information in the row, store the subject in the column qualifier, and put the object in the value rather than the key.

    Column visibilities are optional, and you could add them if a relationship is private/restricted to specific users. You don't need to specify timestamps at all, unless you want to override the default Accumulo behavior that ensures newly written entries have timestamps that are newer than previously written entries.

    Ultimately, your schema is up to you, and these are just a few ideas. You will need to decide how to structure your data based on your requirements. If you are uncertain, I recommend experimenting with a few options, and finding what works best for you.

    Also, as previously suggested, consider using Apache Rya if you're looking for an existing RDF triple store application that builds on Accumulo.