Search code examples
uproot

How to read large set of data as lazyarrays


I am trying to read a large set of data as lazyarrays doing the following:

import uproot
import numpy as np

file_path = "~/data.root"
data = uproot.lazyarrays(file_path, "E")
hits = data['hits']
>>> <ChunkedArray [176 125 318 ... 76 85 51] at 0x7fb8612a8390>
np.array(hits)
>>> array([176, 125, 318, ...,  76,  85,  51], dtype=int32)

So as you can see, we can read 'hits' data as a lazzyarray and as an array without issues. But, when I try the same steps for a different branch, I get a ValueError. Here is how I proceed:

data['hits.dom_id']
>>> ValueError: value too large

However, when I access 'hits.dom_id' using uproot.array() I get my data. Here is how I proceed:

data2 = uproot.open(file_path)['E']['Evt']['hits']
data2['hits.dom_id'].array()
>>> <JaggedArray [[806451572 806451572 806451572 ... 809544061 809544061 809544061] [806451572 806451572 806451572 ... 809524432 809526097 809544061] [806451572 806451572 806451572 ... 809544061 809544061 809544061] ... [806451572 806451572 806451572 ... 809006037 809524432 809544061] [806451572 806451572 806451572 ... 809503416 809503416 809544058] [806451572 806465101 806465101 ... 809544058 809544058 809544061]] at 0x7fb886cbbbd0>

I have notice, but maybe this is just a coincidence, that whenever my data is in a JaggesArray format, uproot.lazyarrays() raises the same ValueError.

I might be doing something wrong here, could you please help?

Note: I don't think it's a RAM issue. I tried playing with the cache size, by using a cache size bigger than my data set and uproot.lazyarrays() still raised the ValueError.

Thank you!


Solution

  • ValueError: value too large is the error message that cachetools emits when it can't put one array into cache. People hit this so often that I think I'll need to catch it and reemit it with a more informative message or maybe even enlarge the cache to make it fit. (Is that a terrible idea? I need to find a good default policy for caches.)

    See the recent GitHub Issues—lazy arrays do have an implicit basketcache (which is different from the cache). You might need to provide an explicit basketcache if any of your baskets are bigger than 1 MB (the default limit).