Search code examples
pandassparse-matrixboolean-operations

fast row-wise boolean operations in sparse matrices


I have a ~4.4M dataframe with purchase orders. I'm interested in a column that indicates the presence of certain items in that purchase order. It is structured like this:

df['item_arr'].head()
1   [a1, a2, a5]
2   [b1, b2, c3...
3   [b3]
4   [a2] 

There are 4k different items, and in each row there is always at least one. I have generated another 4.4M x 4k dataframe df_te_sub with a sparse structure indicating the same array in terms of booleans, i.e.

c = df_te_sub.columns[[10, 20, 30]]
df_te_sub[c].head()
>>a10   b8  c1
0   0   0   0
1   0   0   0
2   0   0   0
3   0   0   0
4   0   0   0

The name of the columns is not important, although it is in alphabetical order, for what is worth.

Given a subset g of items, I am trying to extract the orders (rows) for two different cases:

  1. At least one of the items is present in the row

  2. The items is present in the row are all a subset of g

The first rule I found it the best way was to do:

c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)

The second rule presented a challenge. I tried different things but they are all slow:

# using set
s = set(g)
df['item_arr'].apply(s.issuperset)

# using the "non selected items"
c = df_te_sub[df_te_sub.columns[~df_te_sub.columns.isin(g)]].sparse.to_coo()
x = np.ones(len(df_te_sub), dtype='bool')
x[c.row] = False

# mix
s = set(g)
c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)
df['item_arr'].iloc[rows].apply(s.issuperset)

Any ideas to improve performance? I need to do this for several subsets.

The output can be given either in rows (e.g. [0, 2, 3]) or as a boolean mask (e.g. True False True True ....), as both will work to slice the order dataframe.


Solution

  • I feel like you're overthinking this. If you have a boolean array of membership you've already done 90% of the work.

    from scipy.sparse import csc_matrix
    
    # Turn your sparse data into a sparse array
    arr = csc_matrix(df_te_sub.sparse.to_coo())
    
    # Get the number of items per row
    row_len = arr.sum(axis=1).A.flatten()
    
    # Get the column indices for each item and slice your array
    arr_col_idx = [df.columns.get_loc(g_val) for g_val in g]
    
    # Sum the number of items in g in the slice per row
    arr_g = arr[:, arr_col_idx].sum(axis=1).A.flatten()
    
    # Find all the rows with at least one thing in g
    arr_one_g = arr_g > 0
    
    # Find all the things in the rows which are subsets of G
    # This assumes row_len is always greater than 0, if it isnt add a test for that
    arr_subset_g = (row_len - arr_g) == 0
    

    arr_one_g and arr_subset_g are 1d boolean arrays that should index for the things you want.