Using awkward-array with zip/unzip with two different physics objects

I'm trying to reproduce parts of the Higgs discovery in the Higgs --> 4 leptons channel with open data and making use of awkward. I can do it when the leptons are the same (e.g. 4 muons) with zip/unzip, but is there a way to do it in the 2 muon/2 electron channel? I started with the example in the HSF tutorial

https://hsf-training.github.io/hsf-training-scikit-hep-webpage/04-awkward/index.html

So I now have the following. First get the input file

curl http://opendata.cern.ch/record/12361/files/SMHiggsToZZTo4L.root --output SMHiggsToZZTo4L.root

Then I do the following

import numpy as np
import matplotlib.pylab as plt

import uproot
import awkward as ak

# Helper functions

def energy(m, px, py, pz):
    E = np.sqrt( (m**2) + (px**2 + py**2 + pz**2))
    return E

def invmass(E, px, py, pz):

    m2 = (E**2) - (px**2 + py**2 + pz**2)
    
    if m2 < 0:
        m = -np.sqrt(-m2)
    else:
        m = np.sqrt(m2)
    return m


def convert(pt, eta, phi):
    px = pt * np.cos(phi)
    py = pt * np.sin(phi)
    pz = pt * np.sinh(eta)
    return px, py, pz

####

# Read in the file
infile_name = 'SMHiggsToZZTo4L.root'
infile = uproot.open(infile_name)

# Convert to Cartesian
muon_pt = infile_signal['Events/Muon_pt'].array()
muon_eta = infile_signal['Events/Muon_eta'].array()
muon_phi = infile_signal['Events/Muon_phi'].array()
muon_q = infile_signal['Events/Muon_charge'].array()
muon_mass = infile_signal['Events/Muon_mass'].array()

muon_px,muon_py,muon_pz = convert(muon_pt, muon_eta, muon_phi)
muon_energy = energy(muon_mass, muon_px, muon_py, muon_pz)

# Do the magic
nevents = len(infile['Events/Muon_pt'].array())
# nevents = 1000 # For testing

max_entries = nevents
muons = ak.zip({
    "px": muon_px[0:max_entries],
    "py": muon_py[0:max_entries],
    "pz": muon_pz[0:max_entries],
    "e": muon_energy[0:max_entries],
    "q": muon_q[0:max_entries],
})

quads = ak.combinations(muons, 4)

mu1, mu2, mu3, mu4 = ak.unzip(quads)

mass_try = (mu1.e + mu2.e + mu3.e + mu4.e)**2 - ((mu1.px + mu2.px + mu3.px + mu4.px)**2 + (mu1.py + mu2.py + mu3.py + mu4.py)**2 + (mu1.pz + mu2.pz + mu3.pz + mu4.pz)**2)

mass_try = np.sqrt(mass_try)

qtot = mu1.q + mu2.q + mu3.q + mu4.q

plt.hist(ak.flatten(mass_try[qtot==0]), bins=100,range=(0,300));

And the histogram looks good!

So how would I do this for 2-electron + 2-muon combinations? I would guess there's a way to make lepton_xxx arrays? But I'm not sure how to do this elegantly (quickly) such that I could also create a flag to keep track of what the lepton combinations are?

Thanks!

Matt

Solution

This could be answered in a variety of ways:

make a union array (mixed data types) of electrons and muons
make an array of electrons and muons that are the same type, but have a flag to indicate flavor (electron vs muon)
use ak.combinations with n=2 for the muons, ak.combinations with n=2 again for the electrons, and then combine them with ak.cartesian (and deal with tuples of tuples, rather than one level of tuples, which would mean two calls to ak.unzip)
break the electron and muon collections down into single-charge collections. You'll want exactly 1 positive muon, 1 negative muon, 1 positive electron, and 1 negative electron, so that would be an ak.cartesian of the four collections.

I'll go with the last method because I've decided that it's easiest.

Another thing you probably want to know about is the Vector library. After

import vector
vector.register_awkward()

you don't have to do explicit coordinate transformations or mass calculations. I'll be using that. Here's how I read in the data:

infile = uproot.open("/tmp/SMHiggsToZZTo4L.root")

muon_branch_arrays = infile["Events"].arrays(filter_name="Muon_*")
electron_branch_arrays = infile["Events"].arrays(filter_name="Electron_*")

muons = ak.zip({
    "pt": muon_branch_arrays["Muon_pt"],
    "phi": muon_branch_arrays["Muon_phi"],
    "eta": muon_branch_arrays["Muon_eta"],
    "mass": muon_branch_arrays["Muon_mass"],
    "charge": muon_branch_arrays["Muon_charge"],
}, with_name="Momentum4D")
electrons = ak.zip({
    "pt": electron_branch_arrays["Electron_pt"],
    "phi": electron_branch_arrays["Electron_phi"],
    "eta": electron_branch_arrays["Electron_eta"],
    "mass": electron_branch_arrays["Electron_mass"],
    "charge": electron_branch_arrays["Electron_charge"],
}, with_name="Momentum4D")

And this reproduces your plot:

quads = ak.combinations(muons, 4)
quad_charge = quads["0"].charge + quads["1"].charge + quads["2"].charge + quads["3"].charge
mu1, mu2, mu3, mu4 = ak.unzip(quads[quad_charge == 0])
plt.hist(ak.flatten((mu1 + mu2 + mu3 + mu4).mass), bins=100, range=(0, 200));

The quoted number slices (e.g. "0" and "1") are picking tuple fields, rather than array entries; it's a manual ak.unzip. (The fields could have had real names if I had given a fields argument to ak.combinations.)

For the two muons, two electrons case, let's make four distinct collections.

muons_plus = muons[muons.charge > 0]
muons_minus = muons[muons.charge < 0]
electrons_plus = electrons[electrons.charge > 0]
electrons_minus = electrons[electrons.charge < 0]

The ak.combinations function (with default axis=1) returns the Cartesian product of each list in the array of lists with itself, excluding an item with itself, (x, x) (unless you want that, specify replacement=True), and excluding one of each symmetric pair (x, y)/(y, x).

If you want just a plain Cartesian product of lists from one array with lists from another array, that's ak.cartesian. The four collections muons_plus, muons_minus, electrons_plus, electrons_minus are non-overlapping and we want each four-lepton group to have exactly one from each, so that's a plain Cartesian product.

mu1, mu2, e1, e2 = ak.unzip(ak.cartesian([muons_plus, muons_minus, electrons_plus, electrons_minus]))
plt.hist(ak.flatten((mu1 + mu2 + e1 + e2).mass), bins=100, range=(0, 200));

Separating particles by flavor (electron vs muon) but not by charge is an artifact of the typing constraints imposed by C++. Electrons are measured in different ways from muons, so they have different attributes in the dataset. In a statically typed language like C++, we couldn't put them in the same collection because they differ by type (have different attributes). But charges only differ by value (an integer), so there was no reason they had to be in different collections.

But now, the only thing distinguishing a type is (a) what attributes the objects have and (b) what objects with that set of attributes are named. Here, I named them "Momentum4D" because that allowed Vector to recognize them as Lorentz vectors and give them Lorentz vector methods. But they could have had other attributes (e.g. e_over_p for electrons, but not for muons). charge is a non-Momentum4D attribute.

So if we're going to the trouble to put different-flavor leptons in different arrays, why not put different-charge leptons in different arrays, too? Physically, flavor and charge are both particle properties at the same level of distinction, so we probably want to do the analysis that way. No reason to let a C++ type distinction get in the way of the analysis logic!

Also, you probably want

nevents = infile["Events"].num_entries

to get the number of entries from a TTree without reading the array data from it.