Search code examples
pythonroot-frameworkuproot

Copy TTree to Other File


I'm trying to extract cycles/revisions ("TreeName;3" etc) from one root file and make them their own trees in a new one. I tried doing it by creating a new file and assigning it to a new name, but I get an error telling me that TTree is not writable

with uproot.open("old_file.root") as in_file:
    with uproot.recreate("new_file.root") as out_file:
        for key in in_file.keys():
            ttree = in_file[key]
            new_name = key.replace(";","_")
            out_file[new_name] = ttree

This resulted in NotImplementedError: this ROOT type is not writable: TTree I'm kind of confused because when I print out out_file it tells me that it is a <WritableDirectory '/' ...> I expected it to assign out_file[new_name] to ttree by value. However digging into the documentation "uproot.writing.identify.add_to_directory" says it will raise this error if the object to be added is not writable, so I guess it doesn't just make a copy in memory like I expected it to.

Next I tried to make a new tree first and then move the data in chunk by chunk. However this also didn't work because the tree creation failed:

out_file[new_name] = ttree.typenames()

ValueError: 'extend' must fill every branch with the same number of entries; 'name2' has 7 entries With the typenames being something like {'name1': 'double', 'name2': 'int32_t', 'name3': 'double[]', 'name4': 'int32_t[]', 'name5': 'bool[]'}

Trying to debug it i noticed some very strange behavior

out_file[new_name] = {'name1': 'double', 'name2': 'float32'}

yields the exact same error, while

out_file[new_name] = {'name1': 'float64', 'name2': 'float32'}
out_file[new_name].show()

gives

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
name1                | uint8_t                  | AsDtype('uint8')
name2                | uint8_t                  | AsDtype('uint8')

so at this point I don't know what a datatype is anymore

Finally I tried doing it by writing the arrays but this failed, too

arrays = ttree.arrays(ttree.keys(),library='np')
out_file[key.replace(";","_")] = arrays

giving TypeError: cannot write Awkward Array type to ROOT file: unknown

With similar issues arising using awkward array or pandas


Solution

  • I decided to give a complete working example (following up on comments, above), but found that there are a lot of choices to be made. All you want to do is to copy the input TTree—you don't want to make choices—so you really want a high-level "copy whole TTree" function, but such a function does not exist. (That would be a good addition to Uproot or a new module that uses Uproot to do hadd-type work. A good project if anyone is interested!)

    I'm starting with this file, which may be obtained a variety of ways:

    file_path = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
    
    file_path = "http://opendata.cern.ch/record/12341/files/Run2012BC_DoubleMuParked_Muons.root"
    
    file_path = "/tmp/Run2012BC_DoubleMuParked_Muons.root"
    

    It's big enough that it should be copied in chunks, not all at once. The first chunk sets the types, so it can be performed with an assignment of new branch names to arrays, but subsequent chunks have to call WritableFile.extend because you don't want to replace the new TTree, you want to add to it. Neither of these explicitly deal with types; the types are picked up from the array.

    Here's a first attempt, using "100 MB" as a chunk size. (This will be the sum of TBasket sizes across TBranches in the output TTree. What we're doing here is more than copying; it's repartitioning the data into a new chunk size.)

    with uproot.recreate("/tmp/output.root") as output_file:
        first_chunk = True
    
        with uproot.open(file_path) as input_file:
            input_ttree = input_file["Events"]
    
            for arrays_chunk in input_ttree.iterate(step_size="100 MB"):
                if first_chunk:
                    output_file["Events"] = arrays_chunk
                    first_chunk = False
                else:
                    output_file["Events"].extend(arrays_chunk)
    

    However, it fails because assignment and extend expect a dict of arrays, not a single array.

    So we could ask TTree.iterate to give us a dict of Awkward Arrays, one for each TBranch, rather than a single Awkward Array that represents all of the TBranches. That would look like this:

    with uproot.recreate("/tmp/output.root") as output_file:
        first_chunk = True
    
        with uproot.open(file_path) as input_file:
            input_ttree = input_file["Events"]
    
            for dict_of_arrays in input_ttree.iterate(step_size="100 MB", how=dict):
                if first_chunk:
                    output_file["Events"] = dict_of_arrays
                    first_chunk = False
                else:
                    output_file["Events"].extend(dict_of_arrays)
    

    It copies the file, but whereas the original file had TBranches like

    name                 | typename                 | interpretation                
    ---------------------+--------------------------+-------------------------------
    nMuon                | uint32_t                 | AsDtype('>u4')
    Muon_pt              | float[]                  | AsJagged(AsDtype('>f4'))
    Muon_eta             | float[]                  | AsJagged(AsDtype('>f4'))
    Muon_phi             | float[]                  | AsJagged(AsDtype('>f4'))
    Muon_mass            | float[]                  | AsJagged(AsDtype('>f4'))
    Muon_charge          | int32_t[]                | AsJagged(AsDtype('>i4'))
    

    the new file has TBranches like

    name                 | typename                 | interpretation                
    ---------------------+--------------------------+-------------------------------
    nMuon                | uint32_t                 | AsDtype('>u4')
    nMuon_pt             | int32_t                  | AsDtype('>i4')
    Muon_pt              | float[]                  | AsJagged(AsDtype('>f4'))
    nMuon_eta            | int32_t                  | AsDtype('>i4')
    Muon_eta             | float[]                  | AsJagged(AsDtype('>f4'))
    nMuon_phi            | int32_t                  | AsDtype('>i4')
    Muon_phi             | float[]                  | AsJagged(AsDtype('>f4'))
    nMuon_mass           | int32_t                  | AsDtype('>i4')
    Muon_mass            | float[]                  | AsJagged(AsDtype('>f4'))
    nMuon_charge         | int32_t                  | AsDtype('>i4')
    Muon_charge          | int32_t[]                | AsJagged(AsDtype('>i4'))
    

    What happened is that Uproot didn't know that each of the Awkward Arrays have the same number of items per entry (that the number of pt values in one event is the same as the number of eta values in one event). If the TBranches hadn't all been muons, but some were muons and some were electrons or jets, that wouldn't be true.

    The reason these nMuon_pt, nMuon_eta, etc. TBranches are included at all is because ROOT needs them. The Muon_pt, Muon_eta, etc. TBranches are read, in ROOT, as C++ arrays of variable length, and a C++ user needs to know how big to preallocate an array and after which array entry the contents are uninitialized junk. These are not needed in Python (Awkward Array prevents users from seeing uninitialized junk).

    So you could ignore them. But if you really need to/want to get rid of them, here's a way: build exactly the array you want to write. Now that we're dealing with types, we'll use WritableDirectory.mktree and specify types explicitly. Since every write is an extend, we won't have to keep track of whether we're writing the first_chunk or a subsequent chunk anymore.

    For the Muon_pt, Muon_eta, etc. TBranches to share a counter TBranch, nMuons, you want a Muon field to be an array of variable-length lists of muon objects with pt, eta, etc. fields. That type can be constructed from a string:

    import awkward as ak
    
    muons_type = ak.types.from_datashape("""var * {
        pt: float32,
        eta: float32,
        phi: float32,
        mass: float32,
        charge: int32
    }""", highlevel=False)
    

    Given a chunk of separated arrays with type var * float32, you can make a single array with type var * {pt: float32, eta: float32, ...} with ak.zip.

    muons = ak.zip({
        "pt": chunk["Muon_pt"],
        "eta": chunk["Muon_eta"],
        "phi": chunk["Muon_phi"],
        "mass": chunk["Muon_mass"],
        "charge": chunk["Muon_charge"],
    })
    

    (Printing muons.type gives you the type string back.) This is the form you're likely to be using for a data analysis. The assumption was that users would be analyzing data as objects between a read and a write, not reading from one file and writing to another without any modifications.

    Here's a reader-writer, using muons_type:

    with uproot.recreate("/tmp/output.root") as output_file:
        output_ttree = output_file.mktree("Events", {"Muon": muons_type})
    
        with uproot.open(file_path) as input_file:
            input_ttree = input_file["Events"]
    
            for chunk in input_ttree.iterate(step_size="100 MB"):
                muons = ak.zip({
                    "pt": chunk["Muon_pt"],
                    "eta": chunk["Muon_eta"],
                    "phi": chunk["Muon_phi"],
                    "mass": chunk["Muon_mass"],
                    "charge": chunk["Muon_charge"],
                })
    
                output_ttree.extend({"Muon": muons})
    

    Or you could have done it without explicitly constructing the muons_type by keeping track of the first_chunk again:

    with uproot.recreate("/tmp/output.root") as output_file:
        first_chunk = True
    
        with uproot.open(file_path) as input_file:
            input_ttree = input_file["Events"]
    
            for chunk in input_ttree.iterate(step_size="100 MB"):
                muons = ak.zip({
                    "pt": chunk["Muon_pt"],
                    "eta": chunk["Muon_eta"],
                    "phi": chunk["Muon_phi"],
                    "mass": chunk["Muon_mass"],
                    "charge": chunk["Muon_charge"],
                })
    
                if first_chunk:
                    output_file["Events"] = {"Muon": muons}
                    first_chunk = False
                else:
                    output_file["Events"].extend({"Muon": muons})
    

    It is admittedly complex (because I'm showing many alternatives, with different pros and cons), but that's because copying TTrees without modification wasn't a foreseen use-case for the TTree-writing functions. Since it is an important use-case, a specialized function that hides these details would be a welcome addition.