Search code examples
pythonpandasdataframedata-analysismulti-index

Pandas: Collapsing a multi-index dataframe into a single row and column separated by a comma


I have a df that looks like this:

df.set_index(['pid','vid]).sort_values(by = 'time')

Before:

                          pid    time 

    id           vid      
    
    id1         vis_id1    pid1    t_0      
                vis_id1    pid2    t_1
                vis_id1    pid1    t_2
                vis_id1    pid2    t_3
                vis_id1    pid1    t_4       
    
    id2         vis_id2    pid1    t_3      
                vis_id2    pid2    t_4      
                vis_id2    pid2    t_5      
                vis_id2    pid2    t_6
                vis_id2    pid2    t_7 

I want to collapse all of the pid such that for every id we have the following df

                                  pid       
id           vid      

id1         vis_id1    pid1, pid2, pid1, pid2, pid1               

id2         vis_id2    pid1, pid2, pid2, pid2, pid2      


  

I have tried transposing the data first and then applying [steps].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1) but if I can avoid the transposition all together that would save me a lot of computing power


Solution

  • You can .groupby + .agg() with aggregation with ', '.join, as follows:

    df.groupby(['id', 'vid'])[['pid']].agg(', '.join)
    

    Note the double square bracket [[...]] used around pid in [['pid']]

    Result:

                                          pid
    id  vid                                  
    id1 vis_id1  pid1, pid2, pid1, pid2, pid1
    id2 vis_id2  pid1, pid2, pid2, pid2, pid2