Search code examples
pythonpandasmergedata-cleaning

How do I merge data without duplicating the columns?


Let's say I have multiple data frames df1,df2,df3 (in reality,I have about 11 data frames). These are converted to data frames from excel worksheets.

for example purpose:

     DateTime  | Col1 | Col2 | Col3 |
       ----------------
   jan  | 20 | 30 | 40 |
   Feb  | NaN | NaN| NaN |
   Mar  | NaN | NaN | NaN |
   Apr  | NaN |NaN | NaN |


        | Col1 | Col2 | Col3 |
       ----------------
   jan  | NaN | NaN | NaN |
   Feb  | 20 | 30 | 40 |
   Mar  | NaN | NaN | NaN |
   Apr  | NaN |NaN | NaN |
       | Col1 | Col2 | Col3 |
       ----------------
   jan  | NaN | NaN | NaN |
   Feb  | NaN | NaN | NaN |
   Mar  | 10 | 20 | 80 |
   Apr  | NaN |NaN | NaN |

I would like the output to be

Out= DateTime   | Col1 | Col2 | Col3 |
       ----------------
   jan  | 20 | 30 | 40 |
   Feb  | 20 | 30 | 0 |
   Mar  | 10 | 20 | 80 |
   Apr  | NaN |NaN | NaN |

Say, there are multiple columns but same names in all, and multiple rows but the date column has the same number of rows and names in all the spreadsheets. I tried to join, merge and concatenate but they either overwrite the values, or add extra rows or columns. In the end, I would like the output to have the same number of rows and columns.

I am new to python, so trying to figure this out!


Solution

  • Use concat with aggregate sum, solution working with column date:

    df = pd.concat([df1, df2, df3]).groupby('date', sort=False).sum(min_count=1)
    

    If working with DatetimeIndex:

    df = pd.concat([df1, df2, df3]).groupby(level=0, sort=False).sum(min_count=1)
    

    print (df)
         Col1  Col2  Col3
    jan  20.0  30.0  40.0
    Feb  20.0  30.0  40.0
    Mar  10.0  20.0  80.0
    Apr   NaN   NaN   NaN