Search code examples
pythonpython-3.xawkward-array

Get all attributes with common name on different levels in Awkward array


Does the awkward library provide a way to slice out all attributes of a given name, regardless of the level? I was thinking something like this:

import awkward as ak

obj = {
    'resource_id': 'abc',
    'events': [
        {'resource_id': '123', 'value': 12, 'picks':
            [{'resource_id': 'asd', 'value': 1},
             {'resource_id': 'dll', 'value': 12}
            ]
         },
         {'resource_id': '456', 'value': 12, 'picks':
            [{'resource_id': 'cvf', 'value': 23},
             {'resource_id': 'ggf', 'value': 34},
             ]
         },
    ]
}


ar = ak.from_iter(obj)

rid = ar[..., 'resource_id']

The value of rid is simply the string 'abc' but I was expecting something more like the following:

[
   ['abc'],
   ['events':[
       [['123'], 'picks':[['asd'], ['dll']]], 
       [['456'], 'picks':[['cvf'], ['ggf']]],
   ]
]       

However, I am still trying to get my head around awkward arrays so I could be completely off here.


Solution

  • It doesn't, and I'm not sure how the output of such an operation should be shaped. For instance, if you pick the outer "resource_id", you get

    >>> ar["events", "resource_id"]
    <Array ['123', '456'] type='2 * string'>
    

    but if you pick the inner "resource_id", you get

    >>> ar["events", "picks", "resource_id"]
    <Array [['asd', 'dll'], ['cvf', 'ggf']] type='2 * var * string'>
    

    Note that the ... does have a meaning, but it slices through rows (nested lists), not columns (record field names).

    >>> ar["events", "picks", "value"]
    <Array [[1, 12], [23, 34]] type='2 * var * int64'>
    >>> ar["events", "picks", "value", ..., 0]
    <Array [1, 23] type='2 * int64'>
    

    Also, it might help to know that you can project with strings and lists of strings (nested projection):

    >>> print(ar["events", "picks", ["resource_id", "value"]])
    [[{resource_id: 'asd', value: 1}, ... {resource_id: 'ggf', value: 34}]]
    

    in case that helps with your slicing problem (which will likely be manually picking out "resource_id" at all levels and putting them together in a way that makes sense for your data, but maybe can't be generalized).