Search code examples
multithreadingalgorithmperformanceparallel-processingshuffle

Parallel Computing - Shuffle


I am looking to shuffle an array in parallel. I have found that doing an algorithm similar to bitonic sort but with a random (50/50) re-order results in an equal distribution but only if the array is a power of 2. I've considered the Yates Fisher Shuffle but I can't see how I could parallel-ize it in order to avoid O(N) computations.

Any advice?

Thanks!


Solution

  • There's a good clear recent paper on this here and the references, especially Shun et al 2015 are worth a read.

    But basically you can do this using the same sort of approach that's used in sort -R: shuffle by giving each row a random key value and sorting on that key. And there are lots of ways to do good parallel distributed sort.

    Here's a basic version in python + MPI using an odd-even sort; it goes through P communication steps if P is the number of processors. You can do better than that, but this is pretty simple to understand; it's discussed in this question.

    from __future__ import print_function
    import sys
    import random
    from mpi4py import MPI
    
    comm = MPI.COMM_WORLD
    
    def exchange(localdata, sendrank, recvrank):
        """
        Perform a merge-exchange with a neighbour;
        sendrank sends local data to recvrank,
        which merge-sorts it, and then sends lower
        data back to the lower-ranked process and
        keeps upper data
        """
        rank = comm.Get_rank()
        assert rank == sendrank or rank == recvrank
        assert sendrank < recvrank
    
        if rank == sendrank:
            comm.send(localdata, dest=recvrank)
            newdata = comm.recv(source=recvrank)
        else:
            bothdata = list(localdata)
            otherdata = comm.recv(source=sendrank)
            bothdata = bothdata + otherdata
            bothdata.sort()
            comm.send(bothdata[:len(otherdata)], dest=sendrank)
            newdata = bothdata[len(otherdata):]
        return newdata
    
    def print_by_rank(data, rank, nprocs):
        """ crudely attempt to print data coherently """
        for proc in range(nprocs):
            if proc == rank:
                print(str(rank)+": "+str(data))
                comm.barrier()
        return
    
    def odd_even_sort(data):
        rank = comm.Get_rank()
        nprocs = comm.Get_size()
        data.sort()
        for step in range(1, nprocs+1):
            if ((rank + step) % 2) == 0:
                if rank < nprocs - 1:
                    data = exchange(data, rank, rank+1)
            elif rank > 0:
                data = exchange(data, rank-1, rank)
        return data
    
    def main():
        # everyone get their data
        rank = comm.Get_rank()
        nprocs = comm.Get_size()
        n_per_proc = 5
        data = list(range(n_per_proc*rank, n_per_proc*(rank+1)))
    
        if rank == 0:
            print("Original:")
        print_by_rank(data, rank, nprocs)
    
        # tag your data with random values
        data = [(random.random(), item) for item in data]
    
        # now sort it by these random tags
        data = odd_even_sort(data)
    
        if rank == 0:
            print("Shuffled:")
        print_by_rank([x for _, x in data], rank, nprocs)
    
        return 0
    
    
    if __name__ == "__main__":
        sys.exit(main())
    

    Running gives:

    $ mpirun -np 5 python mergesort_shuffle.py
    Original:
    0: [0, 1, 2, 3, 4]
    1: [5, 6, 7, 8, 9]
    2: [10, 11, 12, 13, 14]
    3: [15, 16, 17, 18, 19]
    4: [20, 21, 22, 23, 24]
    
    Shuffled:
    0: [19, 17, 4, 20, 9]
    1: [23, 12, 3, 2, 8]
    2: [14, 6, 13, 15, 1]
    3: [11, 0, 22, 16, 18]
    4: [5, 10, 21, 7, 24]