Search code examples
uproot

Saving TLorentzVector info in DataFrames for future analysis


I'm wondering about the recommended protocol for converting TLorentzVector information from .root files into Pandas DataFrames. So far, my strategy has been to save pT, eta, and phi information for each particle I care about. I then write my own functions (based on TLorentzVector definitions) to compute any other quantities I might occasionally need, like DeltaR, mT, etc.

I then wondered if I could just save only the TLorentzVector to my DataFrame and use uproot to get quantities like pT, eta, phi, etc. on-the-fly using something like this (which works when I'm running over a DataFrame that I have just converted from a .root file):

for row in df.index:
    print(df.at[row,"leptons_p4_0"].pt)

I quickly realized, though, that Pandas alone doesn't understand what a TLorentzVector is, so this doesn't work when I reload the file later on using pd.read_csv.

My question, then, is how do others recommend I save TLorentzVector information in a DataFrame that I'll open later in pandas, not uproot? It seems like my options are either to save (pT, eta, phi) columns for each particle and then write my own functions, or to save the TLorentzVector components (E, px, py, pz) and use uproot_methods to convert those components back into a TLorentzVector each time I re-load the DataFrame. Or, hopefully, there's another, easier solution I haven't come across yet!

Thanks very much for any suggestions.


Solution

  • Since Pandas does not have any facilities for dealing with Lorentz vectors, expressing them in terms of their components (pT, eta, phi, mass) and writing your own functions for transforming them is the only way to go, especially if you want to save to and from CSV.

    That said, it is possible to create Lorentz vector objects that retain their "Lorentziness" inside of Pandas, but there are limitations. You can create structured data as Awkward Arrays:

    >>> import awkward1 as ak
    >>> import pandas as pd
    >>> import numpy as np
    >>> class Lorentz:
    ...     @property
    ...     def p(self):
    ...         return self.pt * np.cosh(self.eta)
    ... 
    >>> class LorentzRecord(Lorentz, ak.Record): pass
    ... 
    >>> class LorentzArray(Lorentz, ak.Array): pass
    ... 
    >>> ak.behavior["lorentz"] = LorentzRecord
    >>> ak.behavior["*", "lorentz"] = LorentzArray
    >>> array = ak.Array([{"pt": 1.1, "eta": 2.2},
    ...                   {"pt": 3.3, "eta": 4.4},
    ...                   {"pt": 5.5, "eta": -2.2}],
    ...                  with_name="lorentz")
    >>> array
    <LorentzArray [{pt: 1.1, eta: 2.2}, ... eta: -2.2}] type='3 * lorentz["pt": floa...'>
    

    The above defined an array of records with fields pt and eta and gave both the single-record and the array-of-records views a new property p, which is derived from pt and eta.

    >>> # Each record has a pt, eta, and p.
    >>> array[0].pt
    1.1
    >>> array[0].eta
    2.2
    >>> array[0].p
    5.024699161788051
    >>> # The whole array has a pt, eta, and p (columns).
    >>> array.pt
    <Array [1.1, 3.3, 5.5] type='3 * float64'>
    >>> array.eta
    <Array [2.2, 4.4, -2.2] type='3 * float64'>
    >>> array.p
    <Array [5.02, 134, 25.1] type='3 * float64'>
    

    You can put an array of Lorentz records into a Pandas DataFrame:

    >>> df = pd.DataFrame({"column": array})
    >>> df
                     column
    0   {pt: 1.1, eta: 2.2}
    1   {pt: 3.3, eta: 4.4}
    2  {pt: 5.5, eta: -2.2}
    

    and do the same things with it:

    >>> df.column.values.pt
    <Array [1.1, 3.3, 5.5] type='3 * float64'>
    >>> df.column.values.eta
    <Array [2.2, 4.4, -2.2] type='3 * float64'>
    >>> df.column.values.p
    <Array [5.02, 134, 25.1] type='3 * float64'>
    

    but that's because we're pulling the Awkward Array back out to apply these operations.

    >>> df.column.values
    <LorentzArray [{pt: 1.1, eta: 2.2}, ... eta: -2.2}] type='3 * lorentz["pt": floa...'>
    

    Any NumPy functions applied to the DataFrame, such as negation (implicitly calls np.negative), get passed through to the Awkward Array without having to unpack it.

    >>> -df
                      column
    0  {pt: -1.1, eta: -2.2}
    1  {pt: -3.3, eta: -4.4}
    2   {pt: -5.5, eta: 2.2}
    

    but at present, it's the wrong operation: it shouldn't negate the pt. It's possible to further overload that:

    >>> def negative_Lorentz(x):
    ...     return ak.zip({"pt": x.pt, "eta": -x.eta})
    ... 
    >>> ak.behavior[np.negative, "lorentz"] = negative_Lorentz
    >>> -df
                     column
    0  {pt: 1.1, eta: -2.2}
    1  {pt: 3.3, eta: -4.4}
    2   {pt: 5.5, eta: 2.2}
    

    We're still building up a suite of functions for Lorentz arrays, but now they work in the array-at-a-time mode that Pandas operates in. There's a project called vector to define all of these functions for 2D, 3D, and Lorentz vectors, but it's in early stages of development.

    Getting back to the issue of saving—all of the above doesn't help you with that because Pandas "saves" these data by printing them out:

    >>> df.to_csv("whatever.csv")
    

    writes

    ,column
    0,"{pt: 1.1, eta: 2.2}"
    1,"{pt: 3.3, eta: 4.4}"
    2,"{pt: 5.5, eta: -2.2}"
    

    which is not something that can be read back. We can try,

    >>> df2 = pd.read_csv("whatever.csv")
    >>> df2
       Unnamed: 0                column
    0           0   {pt: 1.1, eta: 2.2}
    1           1   {pt: 3.3, eta: 4.4}
    2           2  {pt: 5.5, eta: -2.2}
    >>> df2.column.values
    array(['{pt: 1.1, eta: 2.2}', '{pt: 3.3, eta: 4.4}',
           '{pt: 5.5, eta: -2.2}'], dtype=object)
    

    and so far, it looks good, but it isn't good:

    >>> df2.column.values
    array(['{pt: 1.1, eta: 2.2}', '{pt: 3.3, eta: 4.4}',
           '{pt: 5.5, eta: -2.2}'], dtype=object)
    

    They're strings. They are no longer computable. So if you want to save to a file, break it down into components.

    Maybe all of this can be pulled together into a usable system, but some aspects, like saving these arrays with their "Lorentizness" intact, are not ready yet.