Search code examples
pythonarraysnumpyouter-join

Outer join numpy arrays, fill with 0


In Python, I have two arrays:

import numpy as np

a = np.array([
  [1, 0.2],
  [2, 0.5],
  [3, 0.8]])

b = np.array([
  [2, 0.4],
  [3, 0.7],
  [4, 1.3],
  [5, 2]])

I need to do an "outer join" based on the values of the first column, and fill with 0 in case there is no match, i.e., end up with something like this:

c = np.array([
  [1, 0.2, 0],
  [2, 0.5, 0.4],
  [3, 0.8, 1.7],
  [4, 0, 1.3],
  [5, 0, 2]])

I was first thinking about converting to DataFrames and then doing an outer join, but I need this to be very fast, since I need to do it 18 million times. I don't necessarily need the entire array, two vectors in the same order would also suffice. Having the two arrays as DataFrames in the first place is not an option, because I use the arrays for several operations where NumPy has considerable speed advantage.

How can I do this very fast?


Solution

  • How about something like this?

    keys = np.union1d(a[:, 0], b[:, 0])
    
    # Initialize result array with zeros - assume the 3 columns as output
    c = np.zeros((keys.shape[0], 3))
    c[:, 0] = keys
    
    idx_a = np.searchsorted(keys, a[:, 0])
    idx_b = np.searchsorted(keys, b[:, 0])
    
    # Assign values where keys match
    c[idx_a, 1] = a[:, 1]
    c[idx_b, 2] = b[:, 1]
    
    print(c)
    
    """
    [[1.  0.2 0. ]
     [2.  0.5 0.4]
     [3.  0.8 0.7]
     [4.  0.  1.3]
     [5.  0.  2. ]]
    """