Search code examples
python-3.xawkward-array

Efficient method to replace values in awkward array according to a dictionary?


I have a dictionary with integer keys and float values. I also have a 2D awkward array with integer entries (I'm using awkward1). I want to replace these integers with the corresponding float according to the dictionary, keeping the awkward array format.

Assuming the keys run from 0 to 999, my solution so far is something like this:

resultArray = ak.where(myArray == 0, myDict.get(0), 0)
for key in range(1,1000):
    resultArray = resultArray + ak.where(myArray == key, myDict.get(key), 0)

Is there a faster way to do this?

Update

Minimal reproducible example of my working code:

import awkward as ak # Awkward 1

myArray = ak.from_iter([[0, 1], [2, 1, 0]]) # Creating example array
myDict = {0: 19.5, 1: 34.1, 2: 10.9}

resultArray = ak.where(myArray == 0, myDict.get(0), 0)
for key in range(1,3):
    resultArray = resultArray + ak.where(myArray == key, myDict.get(key), 0)

myArray:

<Array [[0, 1], [2, 1, 0]] type='2 * var * int64'>

resultArray:

<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

Solution

  • When I mentioned in a comment that np.searchsorted is where you should be looking, I hadn't noticed that myDict includes every consecutive integer as a key. Having a dense lookup table like this would allow faster algorithms, which also happen to be simpler in Awkward Array.

    So, assuming that there's a key in myDict for each integer from 0 up to some value, you can equally well represent the lookup table as

    >>> lookup = ak.Array([myDict[i] for i in range(len(myDict))])
    >>> lookup
    <Array [19.5, 34.1, 10.9] type='3 * float64'>
    

    The problem of picking values at 0, 1, and 2 becomes just an array-slice. (This array-slice is an O(n) algorithm for array length n, unlike np.searchsorted, which would be O(n log n). That's the cost of having sparse lookup keys.)

    The problem, however, is that myArray is nested and lookup is not. We can give lookup the same depth as myArray by slicing it up:

    >>> multilookup = lookup[np.newaxis][np.zeros(len(myArray), np.int64)]
    >>> multilookup
    <Array [[19.5, 34.1, 10.9, ... 34.1, 10.9]] type='2 * 3 * float64'>
    >>> multilookup.tolist()
    [[19.5, 34.1, 10.9], [19.5, 34.1, 10.9]]
    

    And then multilookup[myArray] is exactly what you want:

    >>> multilookup[myArray]
    <Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>
    

    The lookup had to be duplicated because each list within myArray uses global indexes in the whole lookup. If the memory involved in creating multilookup is prohibitive, you could instead break myArray down to match it:

    >>> flattened, num = ak.flatten(myArray), ak.num(myArray)
    >>> flattened
    <Array [0, 1, 2, 1, 0] type='5 * int64'>
    >>> num
    <Array [2, 3] type='2 * int64'>
    >>> lookup[flattened]
    <Array [19.5, 34.1, 10.9, 34.1, 19.5] type='5 * float64'>
    >>> ak.unflatten(lookup[flattened], nums)
    <Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>
    

    If your keys are not dense from 0 up to some integer, then you'll have to use np.searchsorted:

    >>> keys = ak.Array(myDict.keys())
    >>> values = ak.Array([myDict[key] for key in keys])
    >>> keys
    <Array [0, 1, 2] type='3 * int64'>
    >>> values
    <Array [19.5, 34.1, 10.9] type='3 * float64'>
    

    In this case, the keys are trivial because it is dense. When using np.searchsorted, you have to explicitly cast the flat Awkward Arrays as NumPy (for now; we're looking to fix that).

    >>> lookup_index = np.searchsorted(np.asarray(keys), np.asarray(flattened), side="left")
    >>> lookup_index
    array([0, 1, 2, 1, 0])
    

    Then we pass it through the trivial keys (which doesn't change it, in this case) before passing it to the values.

    >>> keys[lookup_index]
    <Array [0, 1, 2, 1, 0] type='5 * int64'>
    >>> values[keys[lookup_index]]
    <Array [19.5, 34.1, 10.9, 34.1, 19.5] type='5 * float64'>
    >>> ak.unflatten(values[keys[lookup_index]], num)
    <Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>
    

    But the thing I was waffling about in yesterday's comment was that you have to do this on the flattened form of myArray (flattened) and reintroduce the structure later ak.unflatten, as above. But perhaps we should wrap np.searchsorted as ak.searchsorted to recognize a fully structured Awkward Array in the second argument, at least. (It has to be unstructured to be in the first argument.)