In Python, I have two arrays:
import numpy as np
a = np.array([
[1, 0.2],
[2, 0.5],
[3, 0.8]])
b = np.array([
[2, 0.4],
[3, 0.7],
[4, 1.3],
[5, 2]])
I need to do an "outer join" based on the values of the first column, and fill with 0 in case there is no match, i.e., end up with something like this:
c = np.array([
[1, 0.2, 0],
[2, 0.5, 0.4],
[3, 0.8, 1.7],
[4, 0, 1.3],
[5, 0, 2]])
I was first thinking about converting to DataFrames and then doing an outer join, but I need this to be very fast, since I need to do it 18 million times. I don't necessarily need the entire array, two vectors in the same order would also suffice. Having the two arrays as DataFrames in the first place is not an option, because I use the arrays for several operations where NumPy has considerable speed advantage.
How can I do this very fast?
How about something like this?
keys = np.union1d(a[:, 0], b[:, 0])
# Initialize result array with zeros - assume the 3 columns as output
c = np.zeros((keys.shape[0], 3))
c[:, 0] = keys
idx_a = np.searchsorted(keys, a[:, 0])
idx_b = np.searchsorted(keys, b[:, 0])
# Assign values where keys match
c[idx_a, 1] = a[:, 1]
c[idx_b, 2] = b[:, 1]
print(c)
"""
[[1. 0.2 0. ]
[2. 0.5 0.4]
[3. 0.8 0.7]
[4. 0. 1.3]
[5. 0. 2. ]]
"""