Search code examples
javahashrdfhashcodetriples

How to partition RDF triples in Java based on subject


For partitioning RDF triples by subject, I use String.hashCode() of the subject and put the triple in the corresponding partition. The goal is to be able to process the partitioned files in memory(processing large file may not be possible).

Now in order to have a restricted number of partitions, i do the following. Assuming that we want to have 10 partitions, out of a large RDF file:

    String subject; 
    partitionFileName = subject.hashCode / (Integer.MAX_VALUE/10)

Therefore all the triples with the same subjects will be in one partition and overall we will have 10 partitions.

Now the problem is, when the triples have different distributions, it can lead to very big or very small partitions which are undesired.

Does anybody have any suggestion?

Thank you in advance.


Solution

  • Algorithm:

    • Create a partition for each subject (this can be done on the fly during RDF processing)
    • For each triple, assign it to a partition according to the subject and memoize the subject-partition mapping
    • While number of partitions > 10, merge the two smallest partitions and update the map

    Pros:

    • Ensures all triples with the same subject are in the same partition
    • Balanced as long as your data isn't terribly lopsided
    • You don't have to use the hashcode if you don't want to

    Cons:

    • Additional processing time, though not an onerous amount; this is O(n * m) where n is the number of triples and m is the number of distinct subjects
    • Disparate partition sizes if your data is greatly uneven, but this is unavoidable given that you want all triples with the same subject in the same partition
    • You have to maintain the mapping to perform lookups against, but this is ultimately trivial to do and a constant time operation

    If you don't care about keeping same-subject triples within a single partition, then just create ten buckets and fill them round-robin. O(n) and as balanced as possible.