Search code examples
python-3.xawkward-array

Error when using an awkward array with an index array


I currently have a list of values and an awkward array of integer values. I want the same dimension awkward array, but where the values are the indices of the "values" arrays corresponding with the integer values of the awkward array. For instance:

values = ak.Array(np.random.rand(100))
arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))

I want something like values[arr], but that gives the following error:

>>> values[arr]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\awkward\highlevel.py", line 943, in __getitem__
    return ak._util.wrap(self._layout[where], self._behavior)
ValueError: cannot fit jagged slice with length 2 into RegularArray of size 100

If I run it with a loop, I get back what I want:

>>> values = ([values[i] for i in arr])
>>> values
[<Array [0.842, 0.578, 0.159, ... 0.726, 0.702] type='33 * float64'>, <Array [0.509, 0.45, 0.202, ... 0.906, 0.367] type='125 * float64'>]

Is there another way to do this, or is this it? I'm afraid it'll be too slow for my application.

Thanks!


Solution

  • If you're trying to avoid Python for loops for performance, note that the first line casts a NumPy array as Awkward with ak.from_numpy (no loop, very fast):

    >>> values = ak.Array(np.random.rand(100))
    

    but the second line iterates over data in Python (has a slow loop):

    >>> arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
    

    because a tuple of two NumPy arrays is not a NumPy array. It's a generic iterable, and the constructor falls back to ak.from_iter.

    On your main question, the reason that arr doesn't slice values is because arr is a jagged array and values is not:

    >>> values
    <Array [0.272, 0.121, 0.167, ... 0.152, 0.514] type='100 * float64'>
    >>> arr
    <Array [[15, 24, 9, 42, ... 35, 75, 20, 10]] type='2 * var * int64'>
    

    Note the types: values has type 100 * float64 and arr has type 2 * var * int64. There's no rule for values[arr].

    Since it looks like you want to slice values with arr[0] and then arr[1] (from your list comprehension), it could be done in a vectorized way by duplicating values for each element of arr, then slicing.

    >>> # The np.newaxis is to give values a length-1 dimension before concatenating.
    >>> duplicated = ak.concatenate([values[np.newaxis]] * 2)
    >>> duplicated
    <Array [[0.272, 0.121, ... 0.152, 0.514]] type='2 * 100 * float64'>
    

    Now duplicated has length 2 and one level of nesting, just like arr, so arr can slice it. The resulting array also has length 2, but the length of each sublist is the length of each sublist in arr, rather than 100.

    >>> duplicated[arr]
    <Array [[0.225, 0.812, ... 0.779, 0.665]] type='2 * var * float64'>
    >>> ak.num(duplicated[arr])
    <Array [33, 125] type='2 * int64'>
    

    If you're scaling up from 2 such lists to a large number, then this would eat up a lot of memory. Then again, the size of the output of this operation would also scale as "length of values" × "length of arr". If this "2" is not going to scale up (if it will be at most thousands, not millions or more), then I wouldn't worry about the speed of the Python for loop. Python scales well for thousands, but not billions (depending, of course, on the size of the things being scaled!).