Hello I am new to python and not sure how to achieve.
ID | ENTITY_NAME | ENTITY_NAME | SECTION_GROUP | DOC_ID |
---|---|---|---|---|
1 | dNumber | U220059090 | GROUP 1 | 40 |
2 | tDate | 6-Dec-22 | GROUP 1 | 40 |
3 | dNumber | U220059090 | GROUP 2 | 40 |
4 | tDate | 6-Dec-22 | GROUP 2 | 40 |
5 | sCompany | bp | nan | 40 |
6 | dNumber | U220059090 | GROUP 1 | 45 |
7 | tDate | 6-Dec-22 | GROUP 1 | 45 |
8 | dNumber | U220059090 | GROUP 2 | 45 |
9 | tDate | 6-Dec-22 | GROUP 2 | 45 |
10 | sCompany | bp | nan | 45 |
I have applied group by to section_group column on below data and got two different result set in dataframe as below.
import pandas as pd
import numpy as np
df = pd.read_csv ('dups_check_group_v1.csv',encoding= 'unicode_escape',usecols= ['ID','ENTITY_NAME','ENTITY_VALUE','SECTION_GROUP','DOC_ID'])
mask = df['SECTION_GROUP'].isna()
rest = df[mask]
for _, g in df[~mask].groupby('SECTION_GROUP'):
g = pd.concat([g, rest])
print(g)
Result
ID ENTITY_NAME ENTITY_VALUE SECTION_GROUP DOC_ID
0 1 dNumber U220059090(C) GROUP 1 40
1 2 tDate 6-Dec-22 GROUP 1 40
5 6 dNumber U220059090(C) GROUP 1 45
6 7 tDate 6-Dec-22 GROUP 1 45
4 5 sCompany bp NaN 40
9 10 sCompany bp NaN 45
ID ENTITY_NAME ENTITY_VALUE SECTION_GROUP DOC_ID
2 3 dNumber U220059090(C) GROUP 2 40
3 4 tDate 6-Dec-22 GROUP 2 40
7 8 dNumber U220059090(C) GROUP 2 45
8 9 tDate 6-Dec-22 GROUP 2 45
4 5 sCompany bp NaN 40
9 10 sCompany bp NaN 45
I am trying to merge this two results into single result. like below. any help will be really appreciate.
ID ENTITY_NAME ENTITY_VALUE SECTION_GROUP DOC_ID
0 1 dNumber U220059090(C) GROUP 1 40
1 2 tDate 6-Dec-22 GROUP 1 40
5 6 dNumber U220059090(C) GROUP 1 45
6 7 tDate 6-Dec-22 GROUP 1 45
4 5 sCompany bp NaN 40
9 10 sCompany bp NaN 45
2 3 dNumber U220059090(C) GROUP 2 40
3 4 tDate 6-Dec-22 GROUP 2 40
7 8 dNumber U220059090(C) GROUP 2 45
8 9 tDate 6-Dec-22 GROUP 2 45
4 5 sCompany bp NaN 40
9 10 sCompany bp NaN 45
Just use concat
again:
out = pd.concat([pd.concat([g, rest]) for _, g in
df[~mask].groupby('SECTION_GROUP')])
Or rework your loop:
out = pd.concat([d for _, g in df[~mask].groupby('SECTION_GROUP')
for d in [g, rest]])
Output:
ID ENTITY_NAME ENTITY_NAME.1 SECTION_GROUP DOC_ID
0 1 dNumber U220059090 GROUP 1 40
1 2 tDate 6-Dec-22 GROUP 1 40
5 6 dNumber U220059090 GROUP 1 45
6 7 tDate 6-Dec-22 GROUP 1 45
4 5 sCompany bp NaN 40
9 10 sCompany bp NaN 45
2 3 dNumber U220059090 GROUP 2 40
3 4 tDate 6-Dec-22 GROUP 2 40
7 8 dNumber U220059090 GROUP 2 45
8 9 tDate 6-Dec-22 GROUP 2 45
4 5 sCompany bp NaN 40
9 10 sCompany bp NaN 45