Search code examples
pythonpandasdataframenestedhierarchy

Nested DataFrames/Indexes in python pandas


Aim

I am trying to manipulate data from some video tracking experiments using python pandas. I placed a number of point markers on a structure, and tracked the points' XY coordinates over time. Together these data describe the shape of the structure over the course of the test. I am having trouble arranging my data into a hierarchical/nested DataFrame object.

Importing the data

My tracking method outputs each point's X,Y coordinates (and time) for each frame of video. This data is stored in csv files with a column for each variable, and a row for each video frame:

t,x,y
0.000000000E0,-4.866015168E2,-2.116143012E0
1.000000000E-1,-4.866045511E2,-2.123012558E0
2.000000000E-1,-4.866092436E2,-2.129722560E0

using pandas.read_csv I am able to read these csv files into DataFrames, with the same columns/rows format:

In [1]: pd.read_csv(point_a.csv)
Out[17]: 
     t           x         y
0  0.0 -486.601517 -2.116143
1  0.1 -486.604551 -2.123013
2  0.2 -486.609244 -2.129723

No problem so far.

Creating a hierarchical structure

I would like to merge several of the above DataFrames (one for each point), and create a large DataFrame with hierarchical columns, where all variables share one index (video frames). See the below columns point_a, point_b etc, with subcolumns for x, y, t. The shape column represents useful vectors for plotting the shape of the structure.

        |   point_a     |   point_b     |   point_c     |   shape
frames  |   x   y   t   |   x   y   t   |   x   y   t   |   x               y
-----------------------------------------------------------------------------------
0       |   xa0 ya0 ta0 |   xb0 yb0 tb0 |   xc0 yc0 tc0 |   [xa0,xb0,xc0]   [ya0,yb0,yc0]
1       |   xa1 ya1 ta1 |   xb1 yb1 tb1 |   xc1 yc1 tc1 |   [xa1,xb1,xc1]   [ya1,yb1,yc1]
2       |   xa2 ya2 ta2 |   xb2 yb2 tb2 |   xc2 yc2 tc2 |   [xa2,xb2,xc2]   [ya2,yb2,yc2]
3       |   xa3 ya3 ta3 |   xb3 yb3 tb3 |   xc3 yc3 tc3 |   [xa3,xb3,xc3]   [ya3,yb3,yc3]

I would like to specify a video frame, and be able to grab a variable's value for that frame, e.g. df[1].point_b.y = yb1

What I have tried so far

Nested dicts as input

My previous approach to handling this kind of thing is to use nested dicts:

nested_dicts = {
    "point_a": {
        "x": [xa0, xa1, xa2], 
        "y": [ya0, ya1, ya2], 
        "t": [ta0, ta1, ta2],
        },
    "point_b": {
        "x": [xb0, xb1, xb2], 
        "y": [yb0, yb1, yb2], 
        "t": [tb0, tb1, tb2],
        },
    "point_c": {
        "x": [xc0, xc1, xc2], 
        "y": [yc0, yc1, yc2], 
        "t": [tc0, tc1, tc2],
        },
    }

This does everything I need except for slicing the data by frame number. When I try to use this nested dict as an input to a DataFrame, I get the following:

In [1]: pd.DataFrame(nested_dicts)
Out[2]:
           point_a          point_b          point_c
t  [ta0, ta1, ta2]  [tb0, tb1, tb2]  [tc0, tc1, tc2]
x  [xa0, xa1, xa2]  [xb0, xb1, xb2]  [xc0, xc1, xc2]
y  [ya0, ya1, ya2]  [yb0, yb1, yb2]  [yc0, yc1, yc2]

Problem: there is no shared frames index. The DataFrame has taken t,x,y as the index.

Specifying an index for nested dict input

If I try to specify an index:

In [1]: pd.DataFrame(nested_dicts, index=range(number_of_frames)) 

Then I get a DataFrame with the correct number of rows, but no subcolumns, and full of NaNs:

Out[2]:
    point_a point_b point_c
0   NaN     NaN     NaN    
1   NaN     NaN     NaN  
2   NaN     NaN     NaN  
3   NaN     NaN     NaN  
4   NaN     NaN     NaN  
5   NaN     NaN     NaN  
6   NaN     NaN     NaN  
7   NaN     NaN     NaN  
8   NaN     NaN     NaN 

Adding each DataFrame individually

If I create a DataFrame for each point:

point_a =               point_b =
    t    x    y             t    x    y
0   ta0  xa0  ya0       0   tb0  xb0  yb0
1   ta1  xa1  ya1       1   tb1  xb1  yb1
2   ta2  xa2  ya2       2   tb2  xb2  yb2

and pass these to a DataFrame, indicating the index to be shared, as follows:

In [1]: pd.DataFrame({"point_a":point_a,"point_b":point_b},index=point_a.index)

then I get the following, which just contains x,y,t as strings:

Out[2]:
    point_a point_b
0   (t,)    (t,)
1   (x,)    (x,)
2   (y,)    (y,)

Solution

  • I think you can use dict comprehension with concat and then reshape DataFrame by stack and unstack:

    df = pd.concat({key:pd.DataFrame(nested_dicts[key]) for key in nested_dicts.keys()})
           .stack()
           .unstack([0,2])
    
    print (df)
      point_a           point_b           point_c          
            t    x    y       t    x    y       t    x    y
    0     ta0  xa0  ya0     tb0  xb0  yb0     tc0  xc0  yc0
    1     ta1  xa1  ya1     tb1  xb1  yb1     tc1  xc1  yc1
    2     ta2  xa2  ya2     tb2  xb2  yb2     tc2  xc2  yc2
    

    Another solution with swaplevel and sort first level in MultiIndex in columns by sort_index:

    df = pd.concat({key:pd.DataFrame(nested_dicts[key]) for key in nested_dicts.keys()})
           .unstack(0)
    
    df.columns = df.columns.swaplevel(0,1)
    df = df.sort_index(level=0, axis=1)
    print (df)
      point_a           point_b           point_c          
            t    x    y       t    x    y       t    x    y
    0     ta0  xa0  ya0     tb0  xb0  yb0     tc0  xc0  yc0
    1     ta1  xa1  ya1     tb1  xb1  yb1     tc1  xc1  yc1
    2     ta2  xa2  ya2     tb2  xb2  yb2     tc2  xc2  yc2
    

    Or you can use Panel with transpose and to_frame:

    df = pd.Panel(nested_dicts).transpose(0,1,2).to_frame().unstack()
    print (df)
          point_a           point_b           point_c          
    minor       t    x    y       t    x    y       t    x    y
    major                                                      
    0         ta0  xa0  ya0     tb0  xb0  yb0     tc0  xc0  yc0
    1         ta1  xa1  ya1     tb1  xb1  yb1     tc1  xc1  yc1
    2         ta2  xa2  ya2     tb2  xb2  yb2     tc2  xc2  yc2