Search code examples
pythonarraysperformancenumpyscientific-computing

Fast Numpy calculation of a slight variation of the repeat and choose functions


I would like to use numpy to solve a problem that is very similar, but not quite identical, to the problem solved by the numpy.repeat function. I do not see how to solve this problem using any of the numpy functions I am familiar with, so I'm looking for help to see whether this can be done with numpy. My arrays are large (>1e6 elements), and high-performance is critical, so I cannot afford the performance hit of a python for loop.

Minimal example

I have a length-num_pts sorted integer array objID that stores (possibly repeated) object identifiers.

objID = np.array([0, 0, 5, 5, 5, 7, 8, 8])

I determine the unique entries of objID, and the indices of their appearance in objID, using numpy.unique.

unique_objIDs, idx_unique_objIDs = np.unique(objID, return_index=True)
num_unique_objIDs = len(unique_objIDs)

I also have a length-num_unique_objIDs array occupations that specifies how many times I want to select each entry of unique_objIDs from objID.

occupations = np.array([0, 2, 1, 2])

I want to determine the array of indices that I can use to retrieve elements of objID according to occupations. I give a specific example below.

desired_array_of_indices = np.array([2, 3, 5, 6, 7])

The array desired_array_of_indices is what I want to use numpy to calculate. The entries of desired_array_of_indices are calculated as follows.

Explicit explanation of desired_array_of_indices

Element-i of the occupations array specifies how many times unique_objID[i] will be selected. The desired_array_of_indices array stores the indices of objID of these selections. For values of objID that are selected more than once, successive indices of objID are chosen, so that no indices stored in desired_array_of_indices are repeated.

For concreteness, consider the first element of occupations. The value is zero, telling us that we do not want to select any indices of objID that store unique_objIDs[0]=0, so all such indices are left out of desired_array_of_indices.

The next element of occupations is 2, telling us that we want to select the indices of the first 2 appearances of unique_objIDs[1]=5 in objID. This is why the first two entries of desired_array_of_indices are 2 and 3.

The next element of occupations is 1, telling us that we want to select the index of the first appearance of unique_objIDs[2]=7 in objID. So the next entry of desired_array_of_indices is 5.

The last element of occupations is 2, telling us that we want to select the indices of the first 2 appearances of unique_objIDs[3]=8 in objID. This is why the last two entries of desired_array_of_indices are 6 and 7.

Difference from np.repeat

Note the subtle difference between this calculation and numpy.repeat. For numpy.repeat, the returned indices pertain to the array of unique entries, unique_objIDs. Here I need the indices of objID, and I also need to select successive indices for cases of repeated entries. Each entry of occupations can be assumed to be less than or equal to the total number of times the corresponding entry appears in objID, so there is no danger of an indexing error.

Does anyone see how to formulate this problem in terms of (possibly some collection of) available vectorized Numpy functions?


Solution

  • Here's one way.

    First, your example code:

    In [102]: objID = np.array([0, 0, 5, 5, 5, 7, 8, 8])
    
    In [103]: unique_objIDs, idx_unique_objIDs = np.unique(objID, return_index=True)
    

    [[Note: unique() sorts its argument. You know your input is already sorted, so a more efficient way to get idx_unique_objIDs is:

    idx_unique_objIDs = np.concatenate(([0], np.nonzero(np.diff(objID))[0] + 1))
    

    This operation is O(n) instead of the O(n*log(n)) required by unique. Then you can use

    unique_objIDs = objID[idx_unique_objIDs]
    

    if you need the array of the unique object IDs.]]

    In [104]: occupations = np.array([0, 2, 1, 2])
    

    Now find the desired indices. The result is in line Out[107]:

    In [105]: csum = occupations.cumsum()
    
    In [106]: n = csum[-1]
    
    In [107]: np.arange(n) + np.repeat(idx_unique_objIDs - csum + occupations, occupations)
    Out[107]: array([2, 3, 5, 6, 7])
    

    A closer look:

    csum is the cumulative sum of occupations, and n is the total of occupations:

    In [114]: csum
    Out[114]: array([0, 2, 3, 5])
    
    In [115]: n
    Out[115]: 5
    

    csum can be interpreted as the index of the end of the range of indices (pythonic "end", that is) associated with each occupation. Then csum - occupations holds the indices of the beginnings of the ranges:

    In [116]: csum - occupations
    Out[116]: array([0, 0, 2, 3])
    

    Repeat those starting indices according to the values in occupations:

    In [117]: np.repeat(csum - occupations, occupations)
    Out[117]: array([0, 0, 2, 3, 3])
    

    If that is subtracted from np.arange(n), we have, for each occupation k, a range from 0 to occupation[k]-1 concatenated in an array:

    In [118]: np.arange(n) - np.repeat(csum - occupations, occupations)
    Out[118]: array([0, 1, 0, 0, 1])
    

    That's not quite the desired result. We have to add the (repeated) idx_unique_objIDs so that the values are indices into the array objID:

    In [119]: np.arange(n) - np.repeat(csum - occupations, occupations) + np.repeat(idx_unique_objIDs, occupations)
    Out[119]: array([2, 3, 5, 6, 7])
    

    Now combine those two repeat() calls to get the final expression:

    In [120]: np.arange(n) + np.repeat(idx_unique_objIDs - csum + occupations, occupations)
    Out[120]: array([2, 3, 5, 6, 7])