For partitioning RDF triples by subject, I use String.hashCode() of the subject and put the triple in the corresponding partition. The goal is to be able to process the partitioned files in memory(processing large file may not be possible).
Now in order to have a restricted number of partitions, i do the following. Assuming that we want to have 10 partitions, out of a large RDF file:
String subject;
partitionFileName = subject.hashCode / (Integer.MAX_VALUE/10)
Therefore all the triples with the same subjects will be in one partition and overall we will have 10 partitions.
Now the problem is, when the triples have different distributions, it can lead to very big or very small partitions which are undesired.
Does anybody have any suggestion?
Thank you in advance.
Algorithm:
Pros:
Cons:
If you don't care about keeping same-subject triples within a single partition, then just create ten buckets and fill them round-robin. O(n) and as balanced as possible.