Search code examples
pythonuproot

Arrays of strings from uproot


I have a tree with one branch storing a string. When I read using uproot.open() and then the method arrays() I get the following:

>>> array_train['backtracked_end_process']
<ObjectArray [b'FastScintillation' b'FastScintillation' b'FastScintillation' ... b'FastScintillation' b'FastScintillation' b'FastScintillation'] at 0x7f48936e6c90>

I would like to use this branch to create masks, by doing things like array_train['backtracked_end_process'] != b'FastScintillation' but unfortunately this produces an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-97-a28f3706c5b5> in <module>
----> 1 array_train['backtracked_end_process'] == b'FastScintillation'

~/.local/lib/python3.7/site-packages/numpy/lib/mixins.py in func(self, other)
     23         if _disables_array_ufunc(other):
     24             return NotImplemented
---> 25         return ufunc(self, other)
     26     func.__name__ = '__{}__'.format(name)
     27     return func

~/.local/lib/python3.7/site-packages/awkward/array/objects.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    216                 contents.append(x)
    217 
--> 218         result = getattr(ufunc, method)(*contents, **kwargs)
    219 
    220         if self._util_iscomparison(ufunc):

~/.local/lib/python3.7/site-packages/awkward/array/jagged.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    987                 data = self._util_toarray(inputs[i], inputs[i].dtype)
    988                 if starts.shape != data.shape:
--> 989                     raise ValueError("cannot broadcast JaggedArray of shape {0} with array of shape {1}".format(starts.shape, data.shape))
    990 
    991                 if parents is None:

ValueError: cannot broadcast JaggedArray of shape (24035,) with array of shape ()

Does anyone have any suggestion on how to proceed? Being able to transform it to a numpy.chararray would already solve the problem, but I don't know how to do that.


Solution

  • String-handling is a weak point in uproot. It uses a custom ObjectArray (not even the StringArray in awkward-array), which generates bytes objects on demand. What you'd like is an array-of-strings class with == overloaded to mean "compare each variable-length string, broadcasting a single string to an array if necessary." Unfortunately, neither the uproot ObjectArray of strings nor the StringArray class in awkward-array do that yet.

    So here's how you can do it, admittedly through an implicit Python for loop.

    >>> import uproot, numpy
    >>> f = uproot.open("http://scikit-hep.org/uproot/examples/sample-6.10.05-zlib.root")
    >>> t = f["sample"]
    
    >>> t["str"].array()
    <ObjectArray [b'hey-0' b'hey-1' b'hey-2' ... b'hey-27' b'hey-28' b'hey-29'] at 0x7fe835b54588>
    
    >>> numpy.array(list(t["str"].array()))
    array([b'hey-0', b'hey-1', b'hey-2', b'hey-3', b'hey-4', b'hey-5',
           b'hey-6', b'hey-7', b'hey-8', b'hey-9', b'hey-10', b'hey-11',
           b'hey-12', b'hey-13', b'hey-14', b'hey-15', b'hey-16', b'hey-17',
           b'hey-18', b'hey-19', b'hey-20', b'hey-21', b'hey-22', b'hey-23',
           b'hey-24', b'hey-25', b'hey-26', b'hey-27', b'hey-28', b'hey-29'],
          dtype='|S6')
    
    >>> numpy.array(list(t["str"].array())) == b"hey-0"
    array([ True, False, False, False, False, False, False, False, False,
           False, False, False, False, False, False, False, False, False,
           False, False, False, False, False, False, False, False, False,
           False, False, False])
    

    The loop is implicit in the list constructor that iterates over the ObjectArray, turning each element into a bytes string. This Python list is not good for array-at-a-time operations, so we then construct a NumPy array, which is (at a cost of padding).

    Alternative, probably better:

    While writing this, I remembered that uproot's ObjectArray is implemented using an awkward JaggedArray, so the transformation above can be performed with JaggedArray's regular method, which is probably much faster (no intermediate Python bytes objects, no Python for loop).

    >>> t["str"].array().regular()
    array([b'hey-0', b'hey-1', b'hey-2', b'hey-3', b'hey-4', b'hey-5',
           b'hey-6', b'hey-7', b'hey-8', b'hey-9', b'hey-10', b'hey-11',
           b'hey-12', b'hey-13', b'hey-14', b'hey-15', b'hey-16', b'hey-17',
           b'hey-18', b'hey-19', b'hey-20', b'hey-21', b'hey-22', b'hey-23',
           b'hey-24', b'hey-25', b'hey-26', b'hey-27', b'hey-28', b'hey-29'],
          dtype=object)
    
    >>> t["str"].array().regular() == b"hey-0"
    array([ True, False, False, False, False, False, False, False, False,
           False, False, False, False, False, False, False, False, False,
           False, False, False, False, False, False, False, False, False,
           False, False, False])
    

    (The functionality described above wasn't created intentionally, but it works because the right pieces compose in a fortuitous way.)