Search code examples
pythonpandasmatrixsparse-matrix

Pandas: How to summarize pandas cross table / frequency matrix


I'd like to summarize the results of a cross table/Frequency Matrix which looks at frequency of users having the same session, results is 4044 rows × 4044 columns

UserID  10  50  30  2488  9416 23197            ... 
UserID                                                                                  
10      4   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
50      0   48  2   9   4   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
30      0   2   2   2   2   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
2488    0   9   2   32  4   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
9416    0   4   2   4   4   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   

Is there a way to summarize it to have user-to-user number of matches, something like:

UserID  UserID Occurance
10       50      2
30       2488    5
23197    10      3
30       50      1

Solution

  • Solution:

    1. Create a boolean mask that selects the upper triangular, non-diagonal values from the starting matrix
    2. Reshape both this mask (with .reshape()) and the original matrix (with .stack()) into column vectors of equal size
    3. Use boolean indexing to pick out the rows that you need.

    Example:

    import pandas as pd
    import numpy as np
    
    np.random.seed(1)
    
    # Example data
    df = pd.DataFrame(pd.np.random.randint(0, 4, size=(5, 5)), 
                      index=[10, 50, 30, 2488, 9416], 
                      columns=[10, 50, 30, 2488, 9416])
    
    # Quick and dirty method to make the example data symmetric
    df = df + df.T
    
    df
          10    50    30    2488  9416
    10       2     4     0     0     5
    50       4     6     2     5     1
    30       0     2     0     4     3
    2488     0     5     4     4     0
    9416     5     1     3     0     6
    
    # To select the upper-triangular, non-diagonal entries,
    # take a *lower*-triangular mask, np.tril, 
    # and negate it with ~.
    mask = (~np.tril(np.ones(df.shape)).astype('bool'))
    mask
    array([[False,  True,  True,  True,  True],
           [False, False,  True,  True,  True],
           [False, False, False,  True,  True],
           [False, False, False, False,  True],
           [False, False, False, False, False]])
    
    # Prepare to select rows from the stacked df
    mask = mask.reshape(df.size)
    
    # Stack the columns of the starting matrix into a MultiIndex, 
    # which results in a MultiIndexed Series;
    # select the upper-triangular off-diagonal rows;
    # reset the MultiIndex levels into columns
    df.stack()[mask].reset_index().rename({'level_0': 'UserID_row', 
                                           'level_1': 'UserID_col', 
                                           0: 'Occurrence'}, axis=1)
       UserID_row  UserID_col  Occurrence
    0          10          50           4
    1          10          30           0
    2          10        2488           0
    3          10        9416           5
    4          50          30           2
    5          50        2488           5
    6          50        9416           1
    7          30        2488           4
    8          30        9416           3
    9        2488        9416           0