I am trying to concatenate 3 datasets with same 4 column names and the resultant set is not what I am expecting. The datasets look like **dataset A **
| Col_A | Col_B |Col_C | Col_D|
| ------ | ------|------|------|
| a10 | b10 | | |
| a11 | b11 | | |
| a12 | b12 | 10 | 11 |
| a13 | b13 | | |
**dataset B **
| Col_A | Col_B |Col_C | Col_D|
| ------| ----- |------|------|
| a10 | b10 | 12 | 13 |
| a11 | b11 | 15 | 16 |
| a12 | b12 | | |
| a13 | b13 | | |
**dataset C **
| Col_A | Col_B |Col_C | Col_D|
| ----- | ----- |------|------|
| a10 | b10 | | |
| a11 | b11 | | |
| a12 | b12 |17 | 18 |
| a13 | b13 |20 | 21 |
Resultant dataset I need to look like is after concatenating the datasets in the sequence:
**dataset final **
| Col_A | Col_B |Col_C | Col_D|
| ----- | ----- |------|------|
| a10 | b10 |12 | 13 |
| a11 | b11 | 15 | 16 |
| a12 | b12 | 17 | 18 |
| a13 | b13 | 20 | 21 |
I tried the normal concat()
method but I am not able to solve the part where the values are being replaced when concatenation done in the sequence.
concat_all = pd.concat([df_A,df_B,df_C], axis =0)
Recreating the datasets:
import pandas as pd
dfA = pd.DataFrame({
'Col_A': ['a10', 'a11', 'a12', 'a13'],
'Col_B': ['b10', 'b11', 'b12', 'b13'],
'Col_C': ['', '', '10', ''],
'Col_D': ['', '', '11', '']
})
dfB = pd.DataFrame({
'Col_A': ['a10', 'a11', 'a12', 'a13'],
'Col_B': ['b10', 'b11', 'b12', 'b13'],
'Col_C': ['12', '15', '', ''],
'Col_D': ['13', '16', '', '']
})
dfC = pd.DataFrame({
'Col_A': ['a10', 'a11', 'a12', 'a13'],
'Col_B': ['b10', 'b11', 'b12', 'b13'],
'Col_C': ['', '', '17', '20'],
'Col_D': ['', '', '18', '21']
})
Any leads would be appreciated
You can do something like this:
df = (pd.concat([dfA, dfB, dfC], axis=0)
.query('Col_C != ""')
.groupby(['Col_A', 'Col_B'])
.agg({'Col_C': 'last', 'Col_D': 'last'}))
Col_C Col_D
Col_A Col_B
a10 b10 12 13
a11 b11 15 16
a12 b12 17 18
a13 b13 20 21