Search code examples
pythonpandasdataframeconcatenation

Problem in concatenating 3 datasets with same column names and values being replaced for the same key column


I am trying to concatenate 3 datasets with same 4 column names and the resultant set is not what I am expecting. The datasets look like **dataset A **

| Col_A  | Col_B |Col_C | Col_D|
| ------ | ------|------|------|
| a10    | b10   |      |      |
| a11    | b11   |      |      |      
| a12    | b12   | 10   | 11   |
| a13    | b13   |      |      |      

**dataset B **

| Col_A | Col_B |Col_C | Col_D|
| ------| ----- |------|------|
| a10    | b10  | 12   |  13  |
| a11    | b11  | 15   | 16   |
| a12    | b12  |      |      |
| a13    | b13  |      |      |

**dataset C **

| Col_A | Col_B |Col_C | Col_D|
| ----- | ----- |------|------|
| a10   | b10   |      |      |
| a11   | b11   |      |      |
| a12   | b12   |17    | 18   |
| a13   | b13   |20    |  21  |

Resultant dataset I need to look like is after concatenating the datasets in the sequence:

**dataset final **

| Col_A | Col_B |Col_C | Col_D|
| ----- | ----- |------|------|
| a10   | b10   |12    |   13 |
| a11   | b11   | 15   |  16  |
| a12   | b12   | 17   | 18   |
| a13   | b13   | 20   |  21  |

I tried the normal concat() method but I am not able to solve the part where the values are being replaced when concatenation done in the sequence.

concat_all = pd.concat([df_A,df_B,df_C], axis =0)

Recreating the datasets:

import pandas as pd

dfA = pd.DataFrame({
    'Col_A': ['a10', 'a11', 'a12', 'a13'],
    'Col_B': ['b10', 'b11', 'b12', 'b13'],
    'Col_C': ['', '', '10', ''],
    'Col_D': ['', '', '11', '']
})

dfB = pd.DataFrame({
    'Col_A': ['a10', 'a11', 'a12', 'a13'],
    'Col_B': ['b10', 'b11', 'b12', 'b13'],
    'Col_C': ['12', '15', '', ''],
    'Col_D': ['13', '16', '', '']
})

dfC = pd.DataFrame({
    'Col_A': ['a10', 'a11', 'a12', 'a13'],
    'Col_B': ['b10', 'b11', 'b12', 'b13'],
    'Col_C': ['', '', '17', '20'],
    'Col_D': ['', '', '18', '21']
})

Any leads would be appreciated


Solution

  • You can do something like this:

    df = (pd.concat([dfA, dfB, dfC], axis=0)
            .query('Col_C != ""')
            .groupby(['Col_A', 'Col_B'])
            .agg({'Col_C': 'last', 'Col_D': 'last'}))
    
                Col_C Col_D
    Col_A Col_B            
    a10   b10      12    13
    a11   b11      15    16
    a12   b12      17    18
    a13   b13      20    21