Improving the performance of multiple subsets on a large Dataframe

I have a dataframe containing 6.3 million records and 111 columns. For this example I've limited the dataframe to 27 columns (A-Z). On this dataframe I am trying to run an analysis in which I use different combinations of the columns (with pairs of 5 columns per combination) and subset each of those on the dataframe and do a count of the number of occurrences for each combination and finally evaluate if this count extends a certain threshold and then store the combination. The code is already optimized with an efficient way of running the individual subsets, using numba. But still the overal script I have takes quite some time (7-8 hours). This is because if you use for example 90 columns (which is my actual number used) to make combinations of 5, you get 43.949.268 different combinations. In my case I also use a shifted versions of some columns (value of day before). So for this example I've limited it to 20 columns (A-J 2 times, including the shifted versions).

The columns used are stored in a list, which is converted to numbers because it otherwise gets to big using long strings. The names in the list corresponds with a dictionary containing the subset variables.

Here is the full code example:

import pandas as pd
import numpy as np
import numba as nb
import time
from itertools import combinations

# Numba preparation
@nb.njit('int64(bool_[::1],bool_[::1],bool_[::1],bool_[::1],bool_[::1])', parallel=True)
def computeSubsetLength5(cond1, cond2, cond3, cond4, cond5):
    n = len(cond1)
    assert len(cond2) == n and len(cond3) == n and len(cond4) == n and len(cond5) == n
    subsetLength = 0
    for i in nb.prange(n):
        subsetLength += cond1[i] & cond2[i] & cond3[i] & cond4[i] & cond5[i]
    return subsetLength

# Example Dataframe
np.random.seed(101)
bigDF = pd.DataFrame(np.random.randint(0,11,size=(6300000, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

# Example query list
queryList = ['A_shift0','B_shift0','C_shift0','D_shift0','E_shift0','F_shift0','G_shift0','H_shift0','I_shift0','J_shift0','A_shift1','B_shift1','C_shift1','D_shift1','E_shift1','F_shift1','G_shift1','H_shift1','I_shift1','J_shift1']

# Convert list to numbers for creation combinations
listToNum = list(range(len(queryList)))

# Generate 15504 combinations of the 20 queries without repitition
queryCombinations = combinations(listToNum,5)

# Example query dict
queryDict = {
'query_A_shift0': ((bigDF.A >= 1) & (bigDF.A < 3)),
'query_B_shift0': ((bigDF.B >= 3) & (bigDF.B < 5)),
'query_C_shift0': ((bigDF.C >= 5) & (bigDF.C < 7)),
'query_D_shift0': ((bigDF.D >= 7) & (bigDF.D < 9)),
'query_E_shift0': ((bigDF.E >= 9) & (bigDF.E < 11)),
'query_F_shift0': ((bigDF.F >= 1) & (bigDF.F < 3)),
'query_G_shift0': ((bigDF.G >= 3) & (bigDF.G < 5)),
'query_H_shift0': ((bigDF.H >= 5) & (bigDF.H < 7)),
'query_I_shift0': ((bigDF.I >= 7) & (bigDF.I < 9)),
'query_J_shift0': ((bigDF.J >= 7) & (bigDF.J < 11)),
'query_A_shift1': ((bigDF.A.shift(1) >= 1) & (bigDF.A.shift(1) < 3)),
'query_B_shift1': ((bigDF.B.shift(1) >= 3) & (bigDF.B.shift(1) < 5)),
'query_C_shift1': ((bigDF.C.shift(1) >= 5) & (bigDF.C.shift(1) < 7)),
'query_D_shift1': ((bigDF.D.shift(1) >= 7) & (bigDF.D.shift(1) < 9)),
'query_E_shift1': ((bigDF.E.shift(1) >= 9) & (bigDF.E.shift(1) < 11)),
'query_F_shift1': ((bigDF.F.shift(1) >= 1) & (bigDF.F.shift(1) < 3)),
'query_G_shift1': ((bigDF.G.shift(1) >= 3) & (bigDF.G.shift(1) < 5)),
'query_H_shift1': ((bigDF.H.shift(1) >= 5) & (bigDF.H.shift(1) < 7)),
'query_I_shift1': ((bigDF.I.shift(1) >= 7) & (bigDF.I.shift(1) < 9)),
'query_J_shift1': ((bigDF.J.shift(1) >= 7) & (bigDF.J.shift(1) < 11))
}

totalCountDict = {'queryStrings': [],'totalCounts': []}

# Loop through all query combinations and count subset lengths
start = time.time()
for combi in list(queryCombinations):
    tempList = list(combi)
    queryOne = str(queryList[tempList[0]])
    queryTwo = str(queryList[tempList[1]])
    queryThree = str(queryList[tempList[2]])
    queryFour = str(queryList[tempList[3]])
    queryFive = str(queryList[tempList[4]])

    queryString = '-'.join(map(str,tempList))

    count = computeSubsetLength5(queryDict["query_" + queryOne].to_numpy(), queryDict["query_" + queryTwo].to_numpy(), queryDict["query_" + queryThree].to_numpy(), queryDict["query_" + queryFour].to_numpy(), queryDict["query_" + queryFive].to_numpy())

    if count > 1300:
        totalCountDict['queryStrings'].append(queryString)
        totalCountDict['totalCounts'].append(count)

print(len(totalCountDict['totalCounts']))
stop = time.time()
print("Loop time:", stop - start)

This currently takes about 20 seconds on my Macbook Pro 2020 Intel version, for the 15504 combinations. Any thoughts on how this could be improved? I have tried using multiprocessing, but since I am using numba already for the individual subsets this did not work well together. Am I using an inefficient way to do multiple subsets using a list, dictionary and for loop to subset all the combinations, or is 7-8 hours realistic for doing 44 million subsets on a dataframe of 6.3 million records?

Solution

One solution to speed up by a large factor this code is to pack the bits in the boolean arrays stored in queryDict. Indeed, the code computeSubsetLength5 is likely memory bound (I thought the speed-up provided in my previous answer would be sufficient for the needs).

Here is the function to pack the bits of a boolean array:

@nb.njit('uint64[::1](bool_[::1])')
def toPackedArray(cond):
    n = len(cond)
    res = np.empty((n+63)//64, dtype=np.uint64)

    for i in range(n//64):
        tmp = np.uint64(0)
        for j in range(64):
            tmp |= nb.types.uint64(cond[i*64+j]) << j
        res[i] = tmp

    # Remainder
    if n % 64 > 0:
        tmp = 0
        for j in range(n - (n % 64), n):
            tmp |= cond[j] << j
        res[len(res)-1] = tmp

    return res

Note that the end of the arrays are padded with 0 which does not affect the specific following computation (it may not be the case if you plan to use the boolean arrays in another context). This function is called once for each array like this:

'query_A_shift0': toPackedArray((((bigDF.A >= 1) & (bigDF.A < 3))).to_numpy()),

Once packed, the array can be computed much more efficiently by working directly on 64-bits integers (64-bits per integers are computed at once). Here is the resulting code:

# See: https://en.wikipedia.org/wiki/Hamming_weight
@nb.njit('uint64(uint64)', inline='always')
def popcount64c(x):
    m1  = 0x5555555555555555
    m2  = 0x3333333333333333
    m4  = 0x0f0f0f0f0f0f0f0f
    h01 = 0x0101010101010101
    x -= (x >> 1) & m1
    x = (x & m2) + ((x >> 2) & m2)
    x = (x + (x >> 4)) & m4
    return (x * h01) >> 56

# Numba preparation
@nb.njit('uint64(uint64[::1],uint64[::1],uint64[::1],uint64[::1],uint64[::1])', parallel=True)
def computeSubsetLength5(cond1, cond2, cond3, cond4, cond5):
    n = len(cond1)
    assert len(cond2) == n and len(cond3) == n and len(cond4) == n and len(cond5) == n
    subsetLength = 0
    for i in nb.prange(n):
        subsetLength += popcount64c(cond1[i] & cond2[i] & cond3[i] & cond4[i] & cond5[i])
    return subsetLength

popcount64c counts the number of bits set to 1 in each 64-bits chunks.

Here are results on my 6-core i5-9600KF machine:

Reference implementation: 13.41 s
Proposed implementation:   0.38 s

The proposed implementation is 35 times faster than the (already optimized) Numba implementation. The reason why it is much faster than just 8 times is that data should now fit in the last-level cache of your processor with is often much faster than the RAM (about 5 times on my machine).

If you when to optimize further this code you can work on smaller chunks that fits in the L2 cache and use threads in the combinatoric loop rather than in the still memory bound computeSubsetLength5 function. This should give you a significant speed-up (I expect at least a x2).

The biggest optimisation to apply then probably comes from the overall algorithm. The same logical AND operations are computed several times over and over. Pre-computing most of them on the fly while keeping only the most useful ones should significantly speed the algorithm up (I expect a speed-up of x2).

I am pretty sure there are many other optimisations that can be applied on on the overall algorithm. Performing a brute-force is often sufficient to solve a problem but hardly a requirement.