Search code examples
pythonawkward-array

Creating a new Awkward array from indices


The problem I am facing is to create a new array from a set of indices. Namely, I have a set of particles and jets, for each jet there is a list of indices of which particles belong to the given jet.

    import awkward as ak
    import vector

    particles = ak.Array({
        "mass": ak.Array([1, 2, 3, 4, 5]),
        "px": ak.Array([1, 2, 3, 4, 5]),
        "py": ak.Array([1, 2, 3, 4, 5]),
        "pz": ak.Array([1, 2, 3, 4, 5]),
    })

    particles_p4 = vector.awk(ak.zip({
        "mass": particles["mass"],
        "x": particles["px"],
        "y": particles["py"],
        "z": particles["pz"]
    }))

    jet = ak.Array({
        "constituents": ak.Array([[1, 2, 4, 5], [3]]),
        "energy": ak.Array([1.2, 3.4])
    })

What I would like to get is for each jet the particle_p4 values in a new array like so:

<Array [[{x: 1, y: 1, z: 1, ... z: 4, tau: 4}]] type='2 * var * {"x": int64, "y"...'>

where the first element of that would be:

<Array [{x: 1, y: 1, z: 1, ... z: 5, tau: 5}] type='4 * {"x": int64, "y": int64,...'>

Doing this with a for loop is trivial, however I have the notion that this can be done in a more efficient way with the tools available in Awkward array.

Bonus: What about even more nested arrays, for example where each event has multiple jets?


Solution

  • First, I think you mean you have

    jet = ak.Array({
        "constituents": ak.Array([[0, 1, 3, 4], [2]]),
        "energy": ak.Array([1.2, 3.4])
    })
    

    because I'd expect the "constituents" indexes to be 0-based, not 1-based. But even if it is 1-based, just start by subtracting 1.

    >>> jet.constituents - 1
    <Array [[0, 1, 3, 4], [2]] type='2 * var * int64'>
    

    The biggest problem here is that these indexes are nested one level deeper than the particles_p4 that you want to slice. You want the 0, 1, 3, 4, and also the 2, in your jet.constituents to be indexes in the not-nested list, particles_p4.

    If we just arbitrarily flatten them (axis=-1 means to squash the last/deepest dimension):

    >>> ak.flatten(jet.constituents, axis=-1)
    <Array [0, 1, 3, 4, 2] type='5 * int64'>
    

    these indexes are exactly what you'd need to apply to particles_p4. Here, I'm using the current (2.x) version of Awkward Array, so that I can use .show(), but the integer-array slice works in any version of Awkward Array.

    >>> particles_p4[ak.flatten(jet.constituents, axis=-1)].show(type=True)
    type: 5 * Momentum4D[
        x: int64,
        y: int64,
        z: int64,
        tau: int64
    ]
    [{x: 1, y: 1, z: 1, tau: 1},
     {x: 2, y: 2, z: 2, tau: 2},
     {x: 4, y: 4, z: 4, tau: 4},
     {x: 5, y: 5, z: 5, tau: 5},
     {x: 3, y: 3, z: 3, tau: 3}]
    

    If we take that as a partial solution, all we need to do now is put the nested structure back into the result.

    ak.flatten has an opposite, ak.unflatten, which takes a flat array and adds nestedness from an array of list lengths. You can get the list lengths from the original jet.constituents with ak.num. Again, I'll use axis=-1 so that this answer will generalize to deeper nestings.

    >>> lengths = ak.num(jet.constituents, axis=-1)
    >>> lengths
    <Array [4, 1] type='2 * int64'>
    
    >>> rearranged = particles_p4[ak.flatten(jet.constituents, axis=-1)]
    >>> rearranged
    <MomentumArray4D [{x: 1, y: 1, z: 1, tau: 1}, ..., {...}] type='5 * Momentu...'>
    
    >>> result = ak.unflatten(rearranged, lengths, axis=-1)
    >>> result.show(type=True)
    type: 2 * var * Momentum4D[
        x: int64,
        y: int64,
        z: int64,
        tau: int64
    ]
    [[{x: 1, y: 1, z: 1, tau: 1}, {x: 2, ...}, ..., {x: 5, y: 5, z: 5, tau: 5}],
     [{x: 3, y: 3, z: 3, tau: 3}]]
    

    For the bonus round, if all of the above arrays (particles_p4 and jet) were arrays of lists, where each list represents one event, rather than an array representing one event, then the above would hold. I'm taking it as a given that the length of the particles_p4_by_event is equal to the length of the jet_by_event arrays, and the values of jet_by_event.constituents are indexes within each event in particles_p4_by_event (not global indexes; each event should restart at zero). That is, all of your arrays agree on how many events there are, and each event is handled individually, with no cross-over between events.