Search code examples
pythonpython-3.xsharding

Consistently determine a "1" or a "2" based on a random 16-character ASCII string in Python


Using Python3, I'd like to distribute files onto two hard drives depending on their filename.

/mnt/disk1/
/mnt/disk2/

All filenames are case sensitive 16-character ascii strings (e.g. I38A2NPp0OeyMiw9.jpg).

Based on a filename, how can I evenly split the path to /mnt/disk1 or /mnt/disk2? Ideally I'd like to be able to use N file paths.


Solution

  • Function to map a string (the filename) to an integer between 1 and n:

    def map_dir(s, n=2):
        import hashlib
        m = hashlib.sha256(s.encode('utf-8'))
        return int(m.hexdigest(), 16)%n+1
    

    Example:

    >>> map_dir('example.txt')
    1
    
    >>> map_dir('file.csv')
    2
    

    Checking that it works on 100k random strings and 10 buckets:

    import random, string
    
    def randfname(N=8):
        return ''.join(random.choices(string.ascii_uppercase + string.digits, k=N))
    
    from collections import Counter
    Counter((map_dir(randfname(), n=10) for i in range(100000)))
    

    output:

    Counter({9: 9994,
             2: 10091,
             10: 10078,
             4: 10014,
             3: 9897,
             6: 10143,
             8: 10021,
             7: 9891,
             1: 9919,
             5: 9952})
    

    ~ 10k each, it works!