Search code examples
pythonperformancenumpyunique

Efficiently counting number of unique elements - NumPy / Python


When running np.unique(), it first flattens the array, sorts the array, then finds the unique values. When I have arrays with shape (10, 3000, 3000), it takes about a second to find the uniques, but this quickly adds up as I need to call np.unique() multiple times. Since I only care about the total number of unique numbers in an array, sorting seems like a waste of time.

Is there a faster method of find the total number of unique values in a large array other than np.unique()?


Solution

  • Here's a method that works for an array with dtype np.uint8 that is faster than np.unique.

    First, create an array to work with:

    In [128]: a = np.random.randint(1, 128, size=(10, 3000, 3000)).astype(np.uint8)
    

    For later comparison, find the unique values using np.unique:

    In [129]: u = np.unique(a)
    

    Here's the faster method; v will contain the result:

    In [130]: q = np.zeros(256, dtype=int)
    
    In [131]: q[a.ravel()] = 1
    
    In [132]: v = np.nonzero(q)[0]
    

    Verify that we got the same result:

    In [133]: np.array_equal(u, v)
    Out[133]: True
    

    Timing:

    In [134]: %timeit u = np.unique(a)
    2.86 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [135]: %timeit q = np.zeros(256, dtype=int); q[a.ravel()] = 1; v = np.nonzero(q)
    300 ms ± 5.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    So 2.86 seconds for np.unique(), and 0.3 seconds for the alternative method.