I have the following 4 arrays and I want to get the indices of the values that are equal in the arrays A and X corresponding to the values in the same position in B and Y. So for the following example,
import numpy as np
A = np.asarray([400.5, 100, 700, 200, 15, 900])
B = np.asarray([500.5, 200, 500, 600.5, 8, 999])
X = np.asarray([400.5, 700, 100, 300, 15, 555, 900])
Y = np.asarray([500.5, 500,600.5, 100, 8, 555, 999])
I want to get two arrays with the indices:
where indAB are the indices of the values in A and B that are equal to the values in X and Y and indXY are the indices of the values in X and Y that are equal to the values in A and B.
This is my attempt so far:
def indices(a,b):
setb = set(b)
ind = [i for i, x in enumerate(a) if x in setb]
return ind
iA = np.asarray(indices(A,X))
iB = np.asarray(indices(X,A))
iX = np.asarray(indices(B,Y))
iY = np.asarray(indices(Y,B))
def CommonIndices(a,b):
return np.asarray(list(set(a) & set(b)))
indAB = CommonIndices(iA,iX)
indXY = CommonIndices(iB,iY)
print(indAB) # returns = [0 2 4 5]
print(indXY) # returns = [0 1 2 4 6]
I keep getting [0 1 2 4 6] for indXY which is incorrect. 2 is not supposed to be included because even though 600.5 is in Y and B, 200 and 100 in A and B (respectively) are not equal.
I would be very grateful if someone could offer a solution to this. Many thanks!
The numpy_indexed package (disclaimer: I am its author) contains functionality to do this kind of thing efficiently and elegantly. Memory requirements are linear, and computational requirements NlogN for this method. For the substantial arrays you are considering, the speed benefit relative to the currently accepted brute force method could easily be orders of magnitude:
import numpy as np
import numpy_indexed as npi
A = np.asarray([400.5, 100, 700, 200, 15, 900])
B = np.asarray([500.5, 200, 500, 600.5, 8, 999])
X = np.asarray([400.5, 700, 100, 300, 15, 555, 900])
Y = np.asarray([500.5, 500,600.5, 100, 8, 555, 999])
AB = np.stack([A, B], axis=-1)
XY = np.stack([X, Y], axis=-1)
# casting the AB and XY arrays to npi.index first is not required, but a performance optimization; without this each call to npi.indices would have to re-index the arrays, which is the expensive part
AB = npi.as_index(AB)
XY = npi.as_index(XY)
# npi.indices(list, items) is a vectorized nd-equivalent of list.index(item)
indAB = npi.indices(AB, XY, missing='mask').compressed()
indXY = npi.indices(XY, AB, missing='mask').compressed()
Note that you can choose how to handle missing values as well. Also take a look at the set-operations, such as npi.intersection(XY, AB); they might provider a simpler route to what it is you aim to achieve at a higher level.