Search code examples
pythonstringbucketdata-partitioning

Grouping strings lexicographically (python)


I have N strings that I want to divide lexicographic into M even-sized buckets (+/- 1 string). Also, N>>M.

The direct way would be to sort all the strings and split the resulting list into the M buckets.

I would like to instead approximate this by routing each string as it is created to a bucket, before the full list is available.

Is there a fast and pythonic way to assign strings to buckets? I'm essentially looking for a string-equivalent of the integer modulo operator. Perhaps a hash that preserves lexicographic order? Is that even possible?


Solution

  • You can sort by first two chars of a string, or something of this sort.

    Let's say that M=100, so you should divide the characters into sqrt(M) regions, and each should point to another sqrt(M) regions, then for each string you get, you can compare the first char to decide which region to direct the string to and again for the second char, something like a tree with buckets as leaves and comparisons as nodes.