I would like to use numpy to solve a problem that is very similar, but not quite identical, to the problem solved by the numpy.repeat function. I do not see how to solve this problem using any of the numpy functions I am familiar with, so I'm looking for help to see whether this can be done with numpy. My arrays are large (>1e6 elements), and high-performance is critical, so I cannot afford the performance hit of a python for loop.
I have a length-num_pts sorted integer array objID
that stores (possibly repeated) object identifiers.
objID = np.array([0, 0, 5, 5, 5, 7, 8, 8])
I determine the unique entries of objID
, and the indices of their appearance in objID
, using numpy.unique.
unique_objIDs, idx_unique_objIDs = np.unique(objID, return_index=True)
num_unique_objIDs = len(unique_objIDs)
I also have a length-num_unique_objIDs array occupations
that specifies how many times I want to select each entry of unique_objIDs
from objID
.
occupations = np.array([0, 2, 1, 2])
I want to determine the array of indices that I can use to retrieve elements of objID
according to occupations
. I give a specific example below.
desired_array_of_indices = np.array([2, 3, 5, 6, 7])
The array desired_array_of_indices
is what I want to use numpy to calculate. The entries of desired_array_of_indices
are calculated as follows.
desired_array_of_indices
Element-i of the occupations
array specifies how many times unique_objID[i]
will be selected. The desired_array_of_indices
array stores the indices of objID
of these selections. For values of objID
that are selected more than once, successive indices of objID
are chosen, so that no indices stored in desired_array_of_indices
are repeated.
For concreteness, consider the first element of occupations
. The value is zero, telling us that we do not want to select any indices of objID
that store unique_objIDs[0]=0
, so all such indices are left out of desired_array_of_indices
.
The next element of occupations
is 2, telling us that we want to select the indices of the first 2 appearances of unique_objIDs[1]=5
in objID
. This is why the first two entries of desired_array_of_indices
are 2 and 3.
The next element of occupations
is 1, telling us that we want to select the index of the first appearance of unique_objIDs[2]=7
in objID
. So the next entry of desired_array_of_indices
is 5.
The last element of occupations
is 2, telling us that we want to select the indices of the first 2 appearances of unique_objIDs[3]=8
in objID
. This is why the last two entries of desired_array_of_indices
are 6 and 7.
Note the subtle difference between this calculation and numpy.repeat
. For numpy.repeat
, the returned indices pertain to the array of unique entries, unique_objIDs
. Here I need the indices of objID
, and I also need to select successive indices for cases of repeated entries. Each entry of occupations
can be assumed to be less than or equal to the total number of times the corresponding entry appears in objID
, so there is no danger of an indexing error.
Does anyone see how to formulate this problem in terms of (possibly some collection of) available vectorized Numpy functions?
Here's one way.
First, your example code:
In [102]: objID = np.array([0, 0, 5, 5, 5, 7, 8, 8])
In [103]: unique_objIDs, idx_unique_objIDs = np.unique(objID, return_index=True)
[[Note: unique()
sorts its argument. You know your input is already sorted, so a more efficient way to get idx_unique_objIDs
is:
idx_unique_objIDs = np.concatenate(([0], np.nonzero(np.diff(objID))[0] + 1))
This operation is O(n) instead of the O(n*log(n)) required by unique
. Then you can use
unique_objIDs = objID[idx_unique_objIDs]
if you need the array of the unique object IDs.]]
In [104]: occupations = np.array([0, 2, 1, 2])
Now find the desired indices. The result is in line Out[107]
:
In [105]: csum = occupations.cumsum()
In [106]: n = csum[-1]
In [107]: np.arange(n) + np.repeat(idx_unique_objIDs - csum + occupations, occupations)
Out[107]: array([2, 3, 5, 6, 7])
A closer look:
csum
is the cumulative sum of occupations
, and n
is the total of occupations
:
In [114]: csum
Out[114]: array([0, 2, 3, 5])
In [115]: n
Out[115]: 5
csum
can be interpreted as the index of the end of the range of indices (pythonic "end", that is) associated with each occupation. Then csum - occupations
holds the indices of the beginnings of the ranges:
In [116]: csum - occupations
Out[116]: array([0, 0, 2, 3])
Repeat those starting indices according to the values in occupations
:
In [117]: np.repeat(csum - occupations, occupations)
Out[117]: array([0, 0, 2, 3, 3])
If that is subtracted from np.arange(n)
, we have, for each occupation k
, a range from 0 to occupation[k]-1
concatenated in an array:
In [118]: np.arange(n) - np.repeat(csum - occupations, occupations)
Out[118]: array([0, 1, 0, 0, 1])
That's not quite the desired result. We have to add the (repeated) idx_unique_objIDs
so that the values are indices into the array objID
:
In [119]: np.arange(n) - np.repeat(csum - occupations, occupations) + np.repeat(idx_unique_objIDs, occupations)
Out[119]: array([2, 3, 5, 6, 7])
Now combine those two repeat()
calls to get the final expression:
In [120]: np.arange(n) + np.repeat(idx_unique_objIDs - csum + occupations, occupations)
Out[120]: array([2, 3, 5, 6, 7])