Search code examples
awkward-array

Efficiently sorting and filtering a JaggedArray by another one


I have a JaggedArray (awkward.array.jagged.JaggedArray) that contains indices that point to positions in another JaggedArray. Both arrays have the same length, but each of the numpy.ndarrays that the JaggedArrays contain can be of different length. I would like to sort the second array using the indices of the first array, at the same time dropping the elements from the second array that are not indexed from the first array. The first array can additionally contain values of -1 (could also be replaced by None if needed, but this is currently not that case) that mean that there is no match in the second array. In such a case, the corresponding position in the first array should be set to a default value (e.g. 0).

Here's a practical example and how I solve this at the moment:

import uproot
import numpy as np
import awkward

def good_index(my_indices, my_values):
    my_list = []
    for index in my_indices:
        if index > -1:
            my_list.append(my_values[index])
        else:
            my_list.append(0)
    return my_list

indices = awkward.fromiter([[0, -1], [3,1,-1], [-1,0,-1]])
values = awkward.fromiter([[1.1, 1.2, 1.3], [2.1,2.2,2.3,2.4], [3.1]])

new_map = awkward.fromiter(map(good_index, indices, values))

The resulting new_map is: [[1.1 0.0] [2.4 2.2 0.0] [0.0 3.1 0.0]].

Is there a more efficient/faster way achieving this? I was thinking that one could use numpy functionality such as numpy.where, but due to the different lengths of the ndarrays this fails at least for the ways that I tried.


Solution

  • If all of the subarrays in values are guaranteed to be non-empty (so that indexing with -1 returns the last subelement, not an error), then you can do this:

    >>> almost = values[indices]       # almost what you want; uses -1 as a real index
    >>> almost.content = awkward.MaskedArray(indices.content < 0, almost.content)
    >>> almost.fillna(0.0)
    <JaggedArray [[1.1 0.0] [2.4 2.2 0.0] [0.0 3.1 0.0]] at 0x7fe54c713c88>
    

    The last step is optional because without it, the missing elements are None, rather than 0.0.

    If some of the subarrays in values are empty, you can pad them to ensure they have at least one subelement. All of the original subelements are indexed the same way they were before, since pad only increases the length, if need be.

    >>> values = awkward.fromiter([[1.1, 1.2, 1.3], [], [2.1, 2.2, 2.3, 2.4], [], [3.1]])
    >>> values.pad(1)
    <JaggedArray [[1.1 1.2 1.3] [None] [2.1 2.2 2.3 2.4] [None] [3.1]] at 0x7fe54c713978>