I'm trying to write a join operation between two Numpy arrays and was surprised to find Numpy's recfunctions.join_by
doesn't handle duplicate values.
The approach I'm taking is using the column to be joined and finding an index mapping between them. From looking online, majority of Numpy only solutions suffer the same problem of not being able to handle duplicates (you'll see what I mean down in the code section).
I'm looking to stay entirely within the Numpy library, if at all possible to take advantage of vectorized operations, so ideally no native Python code, Pandas (for other reasons), or numpy-indexed.
Below are a few questions I've looked at:
A way to map one array onto another in numpy? Find index mapping between two numpy arrays Numpy: For every element in one array, find the index in another array Index mapping between two sorted partially overlapping numpy arrays
For example, arrays X
and Y
which are to be joined using a column from each of them, x
and y
respectively.
The mapping is defined as, and f
is what I'm after
mapping = f(x, y)
x = y[mapping]
So for example,
x = np.array([1,1,2,100])
y = np.array([1,2,3,4,5,6,7])
mapping = [0, 0, 1, -] # '-' indicates masked
x = y[mapping]
From looking at similar questions onlinem, find the mapping from x
to y
there is np.where(np.isin(x,y))
which deduplicates values. There is also np.searchsorted(x,y)
which doesn't handle duplicates in x
at all. I'm wondering if there is something else that can be done.
Below is not a correct mapping due to duplicates in x
import numpy as np
x = np.array([1,1,2,100])
y = np.array([1,2,3,4,5,6,7])
mapping = np.searchsorted(x, y)
# [0 2 3 3 3 3 3]
This is also not a correct mapping because mapping needs to be the same length as x
.
import numpy as np
x = np.array([1,1,2,100])
y = np.array([1,2,3,4,5,6,7])
mapping = np.where(np.isin(x, y))[0]
# [0, 1, 2]
Using np.isin()
we can literally create a mask that shows us which values are already in the other array, when you have that you only need to figure out the indices.
import numpy as np
# Arrays to be joined
x = np.array([1, 1, 2, 100, 4, 5, 3, 75])
y = np.array([1, 2, 3, 4, 5, 6, 7])
# Get mask with True and False values
mask = np.isin(x, y)
# [ True True True False True True True False]
# Get indices of every element
indices = np.searchsorted(y, x)
# [0 0 1 7 3 4 2 7]
# Match indices with mask
mapping = np.where(mask, indices, -1)
# [ 0 0 1 -1 3 4 2 -1]
We create the mask, get the indices, and then match the indices with the mask. Values that are not in y get the value -1
This solution fully stays inside the Numpy library