Search code examples
pythonpandaspivot-tabledcast

Pivot_table in pandas with a rowid sequence


I have data like this:

ID        Cue       trial     time     accuracy
A         apple     copy      1450     1 
A         dog       copy      2154     1
A         apple     test1     2121     0
A         dog       test2     0        1
A         apple     final     1231     0
A         dog       final     5411     1
B         apple     copy      818      0
B         ...       ...       ..      ...

And I need to transform it so that each ID and Cue combination is all on one line:

  A       apple   apple    apple   copy  test1   final   1450  2121  1231  1   0   0
  ...

In R, I can do it easily with data.table using dcast like this:

 dcast(data, ID + Cue ~ rowid(ID, Cue), value.var=c("time", "accuracy"))

But in Pandas I have trouble. If I use pivot_table, it does not come out right. For instance, if I do this:

 data.pivot_table(index=['ID', 'Cue', 'TrialType'], values=['time', 'accuracy'])

It always comes out like this:

 A      apple     copy     1450    1
                  final    1231    0
                  test1    2121    0

The problem is that I need the "trial" column to not be stacked, but pivot_table / crosstab do not seem capable of handling this. Be aware that if I put the "columns='trial'" in the pivot_table it will not work for my purposes because the 'trial' column is used as a categorical variable in my analysis.

Any hints are appreciated.


Solution

  • Copy column Cue to new Cue1 for second position by DataFrame.insert, unpivot by DataFrame.melt, add counter by GroupBy.cumcount and reshape by DataFrame.set_index with Series.unstack:

    df.insert(1, 'Cue1', df['Cue'])
    df = df.melt(['ID','Cue'])
    df['g'] = df.groupby(['ID','Cue']).cumcount()
    df = df.set_index(['ID', 'Cue', 'g'])['value'].unstack().reset_index(level=1, drop=True)
    print (df)
    
    g      0      1      2     3      4      5     6     7     8    9    10   11
    ID                                                                          
    A   apple  apple  apple  copy  test1  final  1450  2121  1231    1    0    0
    A     dog    dog    dog  copy  test2  final  2154     0  5411    1    1    1
    B   apple   copy    818     0    NaN    NaN   NaN   NaN   NaN  NaN  NaN  NaN
    

    Another idea:

    c = df.columns.difference(['ID'], sort=False)
    df = (df.groupby(['ID','Cue'])[c]
            .apply(lambda x: pd.DataFrame([x.T.values.ravel()]))
            .reset_index(level=[1,2], drop=True))
    print (df)
            0      1      2     3      4      5       6       7       8    9   10  \
    ID                                                                              
    A   apple  apple  apple  copy  test1  final  1450.0  2121.0  1231.0  1.0  0.0   
    A     dog    dog    dog  copy  test2  final  2154.0     0.0  5411.0  1.0  1.0   
    B   apple   copy    818     0    NaN    NaN     NaN     NaN     NaN  NaN  NaN   
    
         11  
    ID       
    A   0.0  
    A   1.0  
    B   NaN