I have two 3D numpy arrays of large size (~1.2b elements, e.g. 80k x 500 x 30)
.
These two arrays have the same size on 2 dimension but differ on the third, which is related to timestamps. I also have two arrays containing the timestamp values corresponding to the planes along the size differing axis. I would like to stretch/broadcast and fill the shorter array with the last plane that contained values in itself prior to the gap, in order to get to the same size as the larger array such that I can then sum them up. The "gaps" and "last plane" are determined by how the timestamp arrays match that relate to the data arrays.
Here is the logic illustrated (has black font, for dark mode version see further below):
Along the axis in which the arrays differ in size, the corresponding values in the timestamp arrays are dates. However, the granularity of these timestamps can vary. For example, the timestamps relating to the larger array may be a series of consecutive days, and the timestamps relating to the shorter array may be every Friday within the same period, but ending short or beginning late, i.e. missing the first 5 Fridays or the last 10 Fridays when comparing to the start and end of the larger array.
When there is such a "gap", I would like the (temproally speaking) last available plane of the 3D array to stretched over that gap until the next available plane has values.
Note, the Fridays example was just one case. It could also be that the larger array corresponds to every Monday in a certain period and the shorter array is the first day of every month in that period, but ending short or starting late.
Another important note might be that the axis of the 3D arrays relating to the timestamps has the smallest size of all the axis. Meaning for example, array 1 might be (80k, 500, 180) and array 2 might be (80k, 500, 50). So looping over that axis and then repeating planes of size (80k, 500, 1) might work. But I don't know how.
Also, the shorter array is fully contained in the larger.
Let's say there are two arrays A and B
m, k, n, v = 3, 4, 10, 4
A = np.random.randint(10, size=(m, n, k))
B = np.random.randint(10, size=(m, v, k))
A_timestamps = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
B_timestamps = [2, 4, 7, 8]
However, although B_timestamps is fully contained, its lowest timestamp is greater than the lowest timestamp in A. In other words, B start late.
Then the reindexing with this method
equals = np.equal.outer(A_timestamps, B_timestamps)
filled = np.maximum.accumulate(equals, axis=0)
reindex = len(B_timestamps) - np.argmax(filled[:, ::-1], axis=1) - 1
leads to
# array([4, 1, 1, 2, 2, 2, 3, 4, 4, 4], dtype=int64)
starting with a 4.
Ideally, when reindexing the array B, I would want the gap in the front, due to B starting late, to be filled with zeros.
I guess I could check for the first matching timestamp first, only reindex B from there on, and then maybe use np.pad to add zeros at B's front to make up for the excluded timestamps in A.
Repeating image with white font for dark mode users:
This method seems to be a bit more performant than @Chrysophylaxs answer, maybe ~2x for larger input.
First find the indices where B_timestamps
would fit into A_timestamps
. The gaps between these indices represent the number of times each element of B
should be used:
repeats = np.diff(np.searchsorted(A_timestamps, B_timestamps), append=A_timestamps[-1])
B_expanded = np.repeat(B, repeats, axis=1)
For the left padding with 0s:
left_pad_width = A.shape[1] - B_expanded.shape[1]
B_expanded = np.pad(
B_expanded,
pad_width=((0, 0), (left_pad_width, 0), (0, 0)),
constant_values=0,
)