Search code examples
pythonpandasdataframegroup-bypandas-groupby

Pandas rearrange groupby objects ( many-to-many)


I have a many-to-many dataframe that look something like this, where and id could have contains multiple lands, and a land could too contains multiple ids:

t = pd.DataFrame({
    'id': ['a', 'a', 'b', 'c', 'c', 'c'],
    'land': ['A', 'B', 'A', 'A', 'B', 'C'],
    'area': [123, 234, 123, 23, 342, 12],
    'info': ['Im', 'never', 'gonna', 'give', 'you', 'up']
})

Eventually I'm joining this to GIS and I need to rearrange data to one-to-one format grouping by land.

My expected output could be like this: ( I too struggle to get this to print out groupby object. I want the full data instead of using count() or size(), but it will only return pandas.groupbyObject.)

area info
land id
A a 123 Im
b 123 gonna
c 23 give
B a 234 never
c 342 you
C c 12 up

...and to merge the data to make it one-on-one: (just an example output format. Could be your own readable way as long as that one lands can show all ids and information properly)

land id area info
A 1.a ; 2.b ; 3.c a:123 ; b:123 ; c:23 a:'Im' ; b:'gonna' ; c:'give'
B 1.a ; 2.c a:234 ; c:342 a:'never' ; c:'you'
C c 12 'up'

Much appreciated.


Solution

  • Option 1 is simple enough. Just set_index + sort_index:

    option1 = t.set_index(['land','id']).sort_index()
    

    Output:

             area   info
    land id             
    A    a    123     Im
         b    123  gonna
         c     23   give
    B    a    234  never
         c    342    you
    C    c     12     up
    

    Option 2 is a bit tricky. One way is to use groupby.apply where you apply a custom function that combines ids with area and info for each land:

    def combine(x):
        return '; '.join(f'{i}:{j}' for i,j in x) if len(x) > 1 else f'{x[0][1]}'
    
    tmp = t.groupby('land').apply(lambda x: [combine(list(enumerate(x['id'], 1))),
                                             combine(x[['id','area']].to_numpy()),
                                             combine(x[['id','info']].to_numpy())])
    
    option2 = pd.DataFrame(tmp.tolist(), index=tmp.index, columns=['id','area','info'])
    

    Output:

                     id                area                   info
    land                                                          
    A     1:a; 2:b; 3:c  a:123; b:123; c:23  a:Im; b:gonna; c:give
    B          1:a; 2:c        a:234; c:342         a:never; c:you
    C                 c                  12                     up