Search code examples
pandaslistmultidimensional-arraygroup-bynested-lists

Pandas Data frame to multi dimensional list of lists (MxNxD)


Problem:

Attempting to group the below sample dataset into (2) groups, Case and Run, then turning the remaining columns into lists. Why list of list?

  1. Amount of time samples from run to run can vary
  2. Amount of runs per case can vary.

Note I am trying to do this without direct for looping/enumerations.

Case,Run,Timestep,Value0,Value1,ValueN
0,0,0,2.626105446,-0.340988725,1.459768689
0,0,1,-0.200319404,-1.709593368,-1.011965009
0,0,2,-2.433554966,-1.558093989,0.44943261
0,1,0,0.02353679,0.280785593,-0.238524231
0,1,1,0.203162019,-1.502525578,1.40292514
0,1,2,-0.037211304,-0.319558867,-1.322124136
0,2,0,-1.45331218,-0.20914855,0.276561536
0,2,1,-0.304656195,-1.594559675,0.264025112
0,2,2,0.030118014,-0.712998507,-0.044159056
1,0,0,0.237803452,0.153455073,-1.084628161
1,0,1,0.814585507,-0.825123856,0.165566792
1,0,2,0.362165592,-0.12731861,-0.572814142
1,1,0,-0.49709788,-0.499310228,0.082185424
1,1,1,0.326764013,0.914642226,-0.96102063
1,1,2,-0.43253699,0.280958492,0.098690723
1,2,0,-0.854096958,0.559531068,0.547050394
1,2,1,0.331055857,1.021091422,0.032419058
1,2,2,0.712204394,-2.050207709,-0.507117825
2,0,0,-0.545231265,1.112460127,0.156334193
2,0,1,-1.507139756,1.296971255,-1.247736656
2,0,2,-0.661400305,0.50431061,1.24670996
2,1,0,-0.215960487,-0.788252945,0.481096264
2,1,1,-1.233657743,0.315172941,0.862535335
2,1,2,0.371303046,0.770596557,2.017713093
2,2,0,0.282822836,-0.332043452,1.107596241
2,2,1,0.085487262,1.104947912,0.545943326
2,2,2,0.864801017,1.082528278,0.386386756

Ideally my end result would be something along the lines of MxNxD where D = [TimeStep, Value0, Value1, ValueN]. Therefore if arr = data then:

arr[0] = Nested list of all n_runs within Case 0

arr[0][0] = Nested list of all samples of run 0 case 0

arr[0][0][0] = Sample0 (1x4) from run 0 case 0

Desired Output:

Note the format only as I wrote this out manually and copied and pasted the first level of object structure:

[[[[0.0, 2.626105446, -0.340988725, 1.459768689], [1.0, -0.200319404, -1.709593368, -1.011965009], [2.0, -2.433554966, -1.558093989, 0.44943261]], [[0.0, 0.02353679, 0.280785593, -0.238524231], [1.0, 0.203162019, -1.502525578, 1.40292514], [2.0, -0.037211304, -0.319558867, -1.322124136]], [[0.0, -1.45331218, -0.20914855, 0.276561536], [1.0, -0.304656195, -1.594559675, 0.264025112], [2.0, 0.030118014, -0.712998507, -0.044159056]]], [[[0.0, 2.626105446, -0.340988725, 1.459768689], [1.0, -0.200319404, -1.709593368, -1.011965009], [2.0, -2.433554966, -1.558093989, 0.44943261]], [[0.0, 0.02353679, 0.280785593, -0.238524231], [1.0, 0.203162019, -1.502525578, 1.40292514], [2.0, -0.037211304, -0.319558867, -1.322124136]], [[0.0, -1.45331218, -0.20914855, 0.276561536], [1.0, -0.304656195, -1.594559675, 0.264025112], [2.0, 0.030118014, -0.712998507, -0.044159056]]], [[[0.0, 2.626105446, -0.340988725, 1.459768689], [1.0, -0.200319404, -1.709593368, -1.011965009], [2.0, -2.433554966, -1.558093989, 0.44943261]], [[0.0, 0.02353679, 0.280785593, -0.238524231], [1.0, 0.203162019, -1.502525578, 1.40292514], [2.0, -0.037211304, -0.319558867, -1.322124136]], [[0.0, -1.45331218, -0.20914855, 0.276561536], [1.0, -0.304656195, -1.594559675, 0.264025112], [2.0, 0.030118014, -0.712998507, -0.044159056]]]]

Current Output (UPDATE):

The below works, but it's using nested for loops. Trying to figure out a way to make it more efficient or "pythonic."

Code:

import pandas as pd
import numpy as np
import json

COL_NAMES = ['Case', 'Run', 'Timestep', 'Value0', 'Value1', 'ValueN']

case = COL_NAMES[0]
run = COL_NAMES[1]
timestep = COL_NAMES[2]

filepath = 'sample.csv'
with open (filepath, newline='') as raw_data_file:
    df = pd.read_csv(raw_data_file, header=0)


nested_list = []
for cases, group in df.groupby([case])[COL_NAMES[1:]]:
    nested_list.append([])
    for runs, group in group.groupby([run])[COL_NAMES[2:]]:
        nested_list[cases].append([])
        samples = group.to_numpy().tolist()
        nested_list[cases][runs] = samples

data = nested_list

with open("sample.json", "w") as output:
        json.dump((data), output)

Output:

[[[[0.0, 2.626105446, -0.340988725, 1.459768689], [1.0, -0.200319404, -1.709593368, -1.011965009], [2.0, -2.433554966, -1.558093989, 0.44943261]], [[0.0, 0.02353679, 0.280785593, -0.238524231], [1.0, 0.203162019, -1.502525578, 1.40292514], [2.0, -0.037211304, -0.319558867, -1.322124136]], [[0.0, -1.45331218, -0.20914855, 0.276561536], [1.0, -0.304656195, -1.594559675, 0.264025112], [2.0, 0.030118014, -0.712998507, -0.044159056]]], [[[0.0, 0.237803452, 0.153455073, -1.084628161], [1.0, 0.814585507, -0.825123856, 0.165566792], [2.0, 0.362165592, -0.12731861, -0.572814142]], [[0.0, -0.49709788, -0.499310228, 0.082185424], [1.0, 0.326764013, 0.914642226, -0.96102063], [2.0, -0.43253699, 0.280958492, 0.098690723]], [[0.0, -0.854096958, 0.559531068, 0.547050394], [1.0, 0.331055857, 1.021091422, 0.032419058], [2.0, 0.712204394, -2.050207709, -0.507117825]]], [[[0.0, -0.545231265, 1.112460127, 0.156334193], [1.0, -1.507139756, 1.296971255, -1.247736656], [2.0, -0.661400305, 0.50431061, 1.24670996]], [[0.0, -0.215960487, -0.788252945, 0.481096264], [1.0, -1.233657743, 0.315172941, 0.862535335], [2.0, 0.371303046, 0.770596557, 2.017713093]], [[0.0, 0.282822836, -0.332043452, 1.107596241], [1.0, 0.085487262, 1.104947912, 0.545943326], [2.0, 0.864801017, 1.082528278, 0.386386756]]]]

Solution

  • Sequential applying of groupby

    Looks like you need 2 consecutive grouping before transforming to list:

    columns = df.columns.drop(['Case','Run'])
    
    data = df.groupby(['Case']).apply(
            lambda df: df.groupby('Run')[columns].apply(pd.Series.tolist)
    ).to_numpy().tolist()
    

    Applying groupby in a recursion

    Here's how we could do it automatically. Let's say we want to group by more then 2 columns one by one, each time wrapping the result in a separate list. What if we set as index those columns, and then on each iteration we group data by the very first index level (i.e. groupby(level=0)) and apply the same logic to each group with deleted level=0? On the last step we just apply pd.Series.tolist:

    tolist = pd.Series.tolist
    
    def curry(df): 
        return [
            curry(gr.droplevel(0)) if gr.index.nlevels > 1 else tolist(gr) 
            for _, gr in df.groupby(level=0)
        ]
    
    data = curry(df.set_index(['Case', 'Run']))
    

    This is how we can rewrite the code to visualize the recursion on a 3-level index:

    g = df.set_index(['Case','Run','Timestep'])
    
    data = \
       [[[tolist(g) for _, g in g.groupby(level=2)]
                    for _, g in g.groupby(level=1)]
                    for _, g in g.groupby(level=0)]
    

    To solve the original problem, we would stop at 2 levels:

    g = df.set_index(['Case','Run'])
    data = [[tolist(g) for _, g in g.groupby(level=1)]
                       for _, g in g.groupby(level=0)]