Search code examples
pythonalgorithmshuffleapproximation

How to shuffle a 2d binary matrix, preserving marginal distributions


Suppose I have an (n*m) binary matrix df similar to the following:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.binomial(1, .3, size=(6,8)))

    0   1   2   3   4   5   6   7
   ------------------------------
0 | 0   0   0   0   0   1   1   0
1 | 0   1   0   0   0   0   0   0
2 | 0   0   0   0   1   0   0   0
3 | 0   0   0   0   0   1   0   1
4 | 0   1   1   0   1   0   0   0
5 | 1   0   1   1   1   0   0   1

I want to shuffle the values in the matrix to create a new_df of the same shape, such that both marginal distributions are the same, such as the following:

    0   1   2   3   4   5   6   7
   ------------------------------
0 | 0   0   0   0   1   0   0   1
1 | 0   0   0   0   1   0   0   0
2 | 0   0   0   0   0   0   0   1
3 | 0   1   1   0   0   0   0   0
4 | 1   0   0   0   1   1   0   0
5 | 0   1   1   1   0   1   1   0

In the new matrix, the sum of each row is equal to the sum of the corresponding row in the original matrix, and likewise, columns in the new matrix have the same sum as the corresponding column in the original matrix.

The solution is pretty easy to check:

# rows have the same marginal distribution
assert(all(df.sum(axis=1) == new_df.sum(axis=1)))  

# columns have the same marginal distribution
assert(all(df.sum(axis=0) == new_df.sum(axis=0)))

If n*m is small, I can use a brute-force approach to the shuffle:

def shuffle_2d(df):
    """Shuffles a multidimensional binary array, preserving marginal distributions"""
    # get a list of indices where the df is 1
    rowlist = []
    collist = []
    for i_row, row in df.iterrows():
        for i_col, val in row.iteritems():
            if df.loc[i_row, i_col] == 1:
                rowlist.append(i_row)
                collist.append(i_col)

    # create an empty df of the same shape
    new_df = pd.DataFrame(index=df.index, columns=df.columns, data=0)

    # shuffle until you get no repeat coordinates 
    # this is so you don't increment the same cell in the matrix twice
    repeats = 999
    while repeats > 1:
        pairs = list(zip(np.random.permutation(rowlist), np.random.permutation(collist)))
        repeats = pd.value_counts(pairs).max()

    # populate new data frame at indicated points
    for i_row, i_col in pairs:
        new_df.at[i_row, i_col] += 1

    return new_df

The problem is that the brute force approach scales poorly. (As in that line from Indiana Jones and the Last Crusade: https://youtu.be/Ubw5N8iVDHI?t=3)

As a quick demo, for an n*n matrix, the number of attempts needed to get an acceptable shuffle looks like: (in one run)

n   attempts
2   1
3   2
4   4
5   1
6   1
7   11
8   9
9   22
10  4416
11  800
12  66
13  234
14  5329
15  26501
16  27555
17  5932
18  668902
...

Is there a straightforward solution that preserves the exact marginal distributions (or tells you where no other pattern is possible that preserves that distribution)?

As a fallback, I could also use an approximation algorithm that could minimize the sum of squared errors on each row.

Thanks! =)


EDIT: For some reason I wasn't finding existing answers before I wrote this question, but after posting it they all show up in the sidebar:

Is it possible to shuffle a 2D matrix while preserving row AND column frequencies?

Randomize matrix in perl, keeping row and column totals the same

Sometimes all you need to do is ask...


Solution

  • Thanks mostly to https://stackoverflow.com/a/2137012/6361632 for inspiration, here's a solution that seems to work:

    
    def flip1(m):
        """
        Chooses a single (i0, j0) location in the matrix to 'flip'
        Then randomly selects a different (i, j) location that creates
        a quad [(i0, j0), (i0, j), (i, j0), (i, j) in which flipping every
        element leaves the marginal distributions unaltered.  
        Changes those elements, and returns 1.
    
        If such a quad cannot be completed from the original position, 
        does nothing and returns 0.
        """
        i0 = np.random.randint(m.shape[0])
        j0 = np.random.randint(m.shape[1])
    
        level = m[i0, j0]
        flip = 0 if level == 1 else 1  # the opposite value
    
        for i in np.random.permutation(range(m.shape[0])):  # try in random order
            if (i != i0 and  # don't swap with self
                m[i, j0] != level):  # maybe swap with a cell that holds opposite value
                for j in np.random.permutation(range(m.shape[1])):
                    if (j != j0 and  # don't swap with self
                        m[i, j] == level and  # check that other swaps work
                        m[i0, j] != level):
                        # make the swaps
                        m[i0, j0] = flip
                        m[i0, j] = level
                        m[i, j0] = level
                        m[i, j] = flip
                        return 1
    
        return 0
    
    def shuffle(m1, n=100):
        m2 = m1.copy()
        f_success = np.mean([flip1(m2) for _ in range(n)])
    
        # f_success is the fraction of flip attempts that succeed, for diagnostics
        #print(f_success)
    
        # check the answer
        assert(all(m1.sum(axis=1) == m2.sum(axis=1)))
        assert(all(m1.sum(axis=0) == m2.sum(axis=0)))
    
        return m2
    

    Which we can call as:

    m1 = np.random.binomial(1, .3, size=(6,8))
    
    array([[0, 0, 0, 1, 1, 0, 0, 1],
           [1, 0, 0, 0, 0, 0, 1, 0],
           [0, 0, 0, 1, 0, 1, 0, 1],
           [1, 1, 0, 0, 0, 1, 0, 1],
           [0, 0, 0, 0, 0, 1, 0, 0],
           [1, 0, 1, 0, 1, 0, 0, 0]])
    
    m2 = shuffle(m1)
    
    array([[0, 0, 0, 0, 1, 1, 0, 1],
           [1, 0, 0, 0, 0, 1, 0, 0],
           [0, 0, 0, 1, 0, 0, 1, 1],
           [1, 1, 1, 0, 1, 0, 0, 0],
           [0, 0, 0, 0, 0, 1, 0, 0],
           [1, 0, 0, 1, 0, 0, 0, 1]])
    

    How many iterations do we need to get to a steady-state distribution? I've set a default of 100 here, which is sufficient for these small matrices.

    Below I plot the correlation between the original matrix and the shuffled matrix (500 times) for various numbers of iterations.

    for _ in range(500):
        m1 = np.random.binomial(1, .3, size=(9,9)) # create starting df
        m2 = shuffle(m1, n_iters)
        corrs.append(np.corrcoef(m1.flatten(), m2.flatten())[1,0])
    
    plt.hist(corrs, bins=40, alpha=.4, label=n_iters)
    

    correlation distributions for 9x9 starting matrix

    For a 9x9 matrix, we see improvements up until about 25 iterations, beyond which we're in a steady state.

    correlation distributions for 18x18 starting matrix

    For an 18x18 matrix, we see small gains going from 100 to 250 iterations, but not much beyond.

    Note that the correlation between starting and ending distributions is lower for larger matrices, but it takes us longer to get there.