Search code examples
pythonpython-3.xpandasdataframegroup-by

merge group by data into single list or dataframe


Hello I am new to python and not sure how to achieve.

ID ENTITY_NAME ENTITY_NAME SECTION_GROUP DOC_ID
1 dNumber U220059090 GROUP 1 40
2 tDate 6-Dec-22 GROUP 1 40
3 dNumber U220059090 GROUP 2 40
4 tDate 6-Dec-22 GROUP 2 40
5 sCompany bp nan 40
6 dNumber U220059090 GROUP 1 45
7 tDate 6-Dec-22 GROUP 1 45
8 dNumber U220059090 GROUP 2 45
9 tDate 6-Dec-22 GROUP 2 45
10 sCompany bp nan 45

I have applied group by to section_group column on below data and got two different result set in dataframe as below.

import pandas as pd
import numpy as np

df = pd.read_csv ('dups_check_group_v1.csv',encoding= 'unicode_escape',usecols= ['ID','ENTITY_NAME','ENTITY_VALUE','SECTION_GROUP','DOC_ID'])

mask = df['SECTION_GROUP'].isna()
rest = df[mask]

for _, g in df[~mask].groupby('SECTION_GROUP'):
    g = pd.concat([g, rest])
    print(g)

Result

ID ENTITY_NAME   ENTITY_VALUE SECTION_GROUP  DOC_ID
0   1     dNumber  U220059090(C)       GROUP 1      40
1   2       tDate       6-Dec-22       GROUP 1      40
5   6     dNumber  U220059090(C)       GROUP 1      45
6   7       tDate       6-Dec-22       GROUP 1      45
4   5    sCompany             bp           NaN      40
9  10    sCompany             bp           NaN      45

ID ENTITY_NAME   ENTITY_VALUE SECTION_GROUP  DOC_ID
2   3     dNumber  U220059090(C)       GROUP 2      40
3   4       tDate       6-Dec-22       GROUP 2      40
7   8     dNumber  U220059090(C)       GROUP 2      45
8   9       tDate       6-Dec-22       GROUP 2      45
4   5    sCompany             bp           NaN      40
9  10    sCompany             bp           NaN      45

I am trying to merge this two results into single result. like below. any help will be really appreciate.

ID ENTITY_NAME   ENTITY_VALUE SECTION_GROUP  DOC_ID
    0   1     dNumber  U220059090(C)       GROUP 1      40
    1   2       tDate       6-Dec-22       GROUP 1      40
    5   6     dNumber  U220059090(C)       GROUP 1      45
    6   7       tDate       6-Dec-22       GROUP 1      45
    4   5    sCompany             bp           NaN      40
    9  10    sCompany             bp           NaN      45
    2   3     dNumber  U220059090(C)       GROUP 2      40
    3   4       tDate       6-Dec-22       GROUP 2      40
    7   8     dNumber  U220059090(C)       GROUP 2      45
    8   9       tDate       6-Dec-22       GROUP 2      45
    4   5    sCompany             bp           NaN      40
    9  10    sCompany             bp           NaN      45

Solution

  • Just use concat again:

    out = pd.concat([pd.concat([g, rest]) for _, g in
                     df[~mask].groupby('SECTION_GROUP')])
    

    Or rework your loop:

    out = pd.concat([d for _, g in df[~mask].groupby('SECTION_GROUP')
                     for d in [g, rest]])
    

    Output:

       ID ENTITY_NAME ENTITY_NAME.1 SECTION_GROUP  DOC_ID
    0   1     dNumber    U220059090       GROUP 1      40
    1   2       tDate      6-Dec-22       GROUP 1      40
    5   6     dNumber    U220059090       GROUP 1      45
    6   7       tDate      6-Dec-22       GROUP 1      45
    4   5    sCompany            bp           NaN      40
    9  10    sCompany            bp           NaN      45
    2   3     dNumber    U220059090       GROUP 2      40
    3   4       tDate      6-Dec-22       GROUP 2      40
    7   8     dNumber    U220059090       GROUP 2      45
    8   9       tDate      6-Dec-22       GROUP 2      45
    4   5    sCompany            bp           NaN      40
    9  10    sCompany            bp           NaN      45