Search code examples
pythonnumpyoptimizationuniquerows

Removing non-triple rows en masse in numpy array


I have an array A that has 1 million rows and 3 columns. In the last column there are unique integers that help identify data in the other two columns. I would only like to keep data that has three of the same unique integer occurrences, and delete all other rows that have other amounts of unique integer occurrences (i.e. for unique integers that are only appearing once, twice, or four times for example). Below is a function remove_loose_ends that I wrote to handle this. However, this function is being called many times and is the bottleneck of the entire program. Are there any possible enhancements that could remove the loop from this operation or decrease its runtime in other ways?

import numpy as np
import time


def remove_loose_ends(A):
    # get unique counts
    unique_id, unique_counter = np.unique(A[:, 2], return_counts=True)
    # initialize outgoing indice mask
    good_index = np.array([[True] * (A.shape[0])])
    # loop through all indices and flip them to false if they match the not triplet entries
    for i in range(0, len(unique_id)):
        if unique_counter[i] != 3:
            good_index = good_index ^ (A[:, 2] == unique_id[i])
    # return incoming array with mask applied
    return A[np.squeeze(good_index), :]

# example array A
A = np.random.rand(1000000,3)
# making last column "unique" integers
A[:,2] = (A[:,2] * 1e6).astype(np.int)

# timing function call
start = time.time()
B = remove_loose_ends(A)
print(time.time() - start)


Solution

  • So, the main problem is that you loop over all the values twice basically, making it roughly an operation. What you could do instead, is create an array of booleans directly from the output of the numpy.unique function to do the indexing for you.

    For example, something like this:

    import numpy as np
    import time
    
    
    def remove_loose_ends(A):
        # get unique counts
        _, unique_inverse, unique_counter = np.unique(A[:, 2], return_inverse=True, return_counts=True)
        # Obtain boolean array of which integers occurred 3 times
        idx = unique_counter == 3
    
        # Obtain boolean array of which rows have integers that occurred 3 times
        row_idx = idx[unique_inverse]
    
        # return incoming array with mask applied
        return A[row_idx, :]
    
    # example array A
    A = np.random.rand(1000000,3)
    # making last column "unique" integers
    A[:,2] = (A[:,2] * 1e6).astype(np.int)
    
    # timing function call
    start = time.time()
    B = remove_loose_ends(A)
    print(time.time() - start)
    

    I tried timing both versions. The function you posted I stopped after 15 minutes, whereas the one I give takes around 0.15s on my PC.