Search code examples
pythonpandasdataframecudf

Extracting specific rows from a multi-indexed Pandas Dataframe to form new DataFrame


I have a data set that I am loading onto a Pandas dataframe that is a Jagged 3-D array called: Waveform. The dataframe is multi-indexed by three levels: Events (Entry), Photons (Subentry) generated by each event, data points (subsubentry) per photon.

The number of data points and Photons varies randomly per each event, hence why it's jagged. I want to extract the Photons (subentries) for each Event (entry) that contain at minimum an "2*n" number of data points, take the average of the first n data points of the selected photons and save them in a new dataframe that contains their respective event and photon index.

I can't put my actual data here because it's too large and jagged so I will create an example that is the same, just scaled down in terms of data.

import awkward as ak
import pandas as pd

#I'm gonna generate an example awkward array that I want to convert to a Pandas DF
wf = ak.to_pandas(ak.Array([ [[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3],[0]],[[1]],[[0.1,23,534,21,53,12],[0]],[[1],[2],[0],[12,12,12,12,125,34]],[[76],[23,23,43],],[[0],[12,12,12,12]] ]))

print(wf)
                              values
entry subentry subsubentry          
0     0        0               1.000
               1               2.000
               2               5.000
               3               6.000
               4               8.000
               5               3.000
               6              21.000
               7               3.000
      1        0            5986.472
               1               0.000
               2               6.000
               3               1.000
               4               2.000
               5               3.000
      2        0               0.000
1     0        0               1.000
2     0        0               0.100
               1              23.000
               2             534.000
               3              21.000
               4              53.000
               5              12.000
      1        0               0.000
3     0        0               1.000
      1        0               2.000
      2        0               0.000
      3        0              12.000
               1              12.000
               2              12.000
               3              12.000
               4             125.000
               5              34.000
4     0        0              76.000
      1        0              23.000
               1              23.000
               2              43.000
5     0        0               0.000
      1        0              12.000
               1              12.000
               2              12.000
               3              12.000

#This is what I want the filter/Extraction to produce

wf_pF = ak.to_pandas(ak.Array([[[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3,5]],[[0.1,23,534,21,53,12]],[[12,12,12,12,125,34]] ]))

print(wf_pF)
                              values
entry subentry subsubentry          
0     0        0               1.000
               1               2.000
               2               5.000
               3               6.000
               4               8.000
               5               3.000
               6              21.000
               7               3.000
      1        0            5986.472
               1               0.000
               2               6.000
               3               1.000
               4               2.000
               5               3.000
               6               5.000
1     0        0               0.100
               1              23.000
               2             534.000
               3              21.000
               4              53.000
               5              12.000
2     0        0              12.000
               1              12.000
               2              12.000
               3              12.000
               4             125.000
               5              34.000


#I then want to take the average of the first n datapoints and place them into a new dataframe as such

averages = ak.to_pandas(ak.Array([[2.666,1997.333],[185.7],[12]]))
print(averages)
                  values
entry subentry          
0     0            2.666
      1         1997.333
1     0          185.700
2     0           12.000

I used query to look for the 2n -1 datapoint first (in this case I used n = 3 so "5") in the level subsubentry Wf_n = wf.query('subsubentry == 5'). I took the index of this new dataframe Wf_n and converted the indices of the Entry and Subentry into their respective NumPy arrays

nQuery  = wf.query('subsubentry == 5')
indices = nQuery.index.to_frame()["entry"]
indices2 = nQuery.index.to_frame()["subentry"]
ind = pd.Series.to_numpy(indices)
ind2 = pd.Series.to_numpy(indices2)

Then I used query to extract the Entries with their respective subentries with the following:

wf_AF = wf.query("entry in @ind and subentry in @ind2")
print(wf_AF)

which results in this dataframe wf_AF

                              values
entry subentry subsubentry          
0     0        0               1.000
               1               2.000
               2               5.000
               3               6.000
               4               8.000
               5               3.000
               6              21.000
               7               3.000
      1        0            5986.472
               1               0.000
               2               6.000
               3               1.000
               4               2.000
               5               3.000
2     0        0               0.100
               1              23.000
               2             534.000
               3              21.000
               4              53.000
               5              12.000
      1        0               0.000
3     0        0               1.000
      1        0               2.000
      3        0              12.000
               1              12.000
               2              12.000
               3              12.000
               4             125.000
               5              34.000

It's still keeping subentries (Photons) that contain less than the desired subsubentries (datapoints) number threshold, 2*n. What am I doing wrong? Is there something I am not understanding? What can I do to achieve this specific method of filtering and can it be implemented in CuDF? Because there's so much data it would be ideal if I could replicate this as well in CuDF.


Solution

  • UPDATE: I deserve a dang medal...

    Reading through the documentation of awkward, particularly check out help(ak.Array.__getitem__) as it's very informative. Also see help(ak.num), help(ak.mean) etc.

    These use awkwards' own methods to filter, slice, and perform calculations on the array... which I'm sure is significantly faster than pandas. Pandas are only used here for visualization.

    array = ak.Array([ [[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3],[0]],[[1]],[[0.1,23,534,21,53,12],[0]],[[1],[2],[0],[12,12,12,12,125,34]],[[76],[23,23,43],],[[0],[12,12,12,12]] ])
    filtered = array[ak.num(array, -1)>5]
    df = ak.to_pandas(filtered)
    print(df)
    

    Output:

                                  values
    entry subentry subsubentry
    0     0        0               1.000
                   1               2.000
                   2               5.000
                   3               6.000
                   4               8.000
                   5               3.000
                   6              21.000
                   7               3.000
          1        0            5986.472
                   1               0.000
                   2               6.000
                   3               1.000
                   4               2.000
                   5               3.000
    2     0        0               0.100
                   1              23.000
                   2             534.000
                   3              21.000
                   4              53.000
                   5              12.000
    3     0        0              12.000
                   1              12.000
                   2              12.000
                   3              12.000
                   4             125.000
                   5              34.000
    

    Say we only want the first three values from each of those...

    first_three = filtered[:,:,:3]
    df2 = ak.to_pandas(first_three)
    print(df2)
    

    Output:

                                  values
    entry subentry subsubentry
    0     0        0               1.000
                   1               2.000
                   2               5.000
          1        0            5986.472
                   1               0.000
                   2               6.000
    2     0        0               0.100
                   1              23.000
                   2             534.000
    3     0        0              12.000
                   1              12.000
                   2              12.000
    

    Now say we want the average of those...

    averages = ak.mean(first_three, axis=2)
    df3 = ak.to_pandas(averages)
    print(df3)
    

    Output:

                         values
    entry subentry
    0     0            2.666667
          1         1997.490667
    2     0          185.700000
    3     0           12.000000
    

    All in one step:

    ak.to_pandas(ak.mean(array[ak.num(array, -1)>5,:3], axis=2))
    


    Pandas Only:

    n = 3
    wf_AF = wf.reset_index(-1).groupby(level=[-2, -1]).filter(lambda x: x.subsubentry.max() >= 2*n-1).set_index('subsubentry', append=True)
    # OR
    wf_AF = wf.groupby(level=[-3, -2]).filter(lambda x: x.index.to_series().str[2].max() >= 2*n-1)
    print(wf_AF)
    

    Output:

                                  values
    entry subentry subsubentry
    0     0        0               1.000
                   1               2.000
                   2               5.000
                   3               6.000
                   4               8.000
                   5               3.000
                   6              21.000
                   7               3.000
          1        0            5986.472
                   1               0.000
                   2               6.000
                   3               1.000
                   4               2.000
                   5               3.000
    2     0        0               0.100
                   1              23.000
                   2             534.000
                   3              21.000
                   4              53.000
                   5              12.000
    3     3        0              12.000
                   1              12.000
                   2              12.000
                   3              12.000
                   4             125.000
                   5              34.000
    

    averages = wf_AF.groupby(level=[-3, -2]).agg(lambda x: x.head(n).mean())
    print(averages)
    

    Output:

                         values
    entry subentry
    0     0            2.666667
          1         1997.490667
    2     0          185.700000
    3     3           12.000000