Search code examples
pythonpandasdata-cleaning

Join Dataframes in Pandas and Sum Columns


Lets say I have 2 dataframes in pandas. I want to perform a left join on these dataframes in a very specific way as follows, and the easiest way to explain is probably via an example.

First dataframe:

Date   Col1  Col2
1/1
2/1
...

Second Dataframe:

Date   A   B   C
1/1     90  0   0
1/1     0   75  0
1/1     73  0   0
2/1     0    0   85
2/1     0    0   75

The dates in the first dataframe are unique and will be used to join to the second dataframe. Col1 and Col2 are not blank, but they are irrelevant for what I want to do. In second data frame, a date can appear multiple times, and exactly one of the columns A, B, and C contain a number greater than 0, with the other two containing 0. There could also be some extra columns which are irrelevant for what I want to do.

I want to keep everything in the first dataframe and add columns A, B and C in such a way that the values in those columns would be the sum of all values on the given data in the second dataframe. So in the example above, I would want the output to look like this:

Date   Col1  Col2   A   B   C
1/1                       163 75   0
2/1                          0   0   160
...

I have looked at the pandas join function but it doesn't seem to give the option to sum columns in the way that I want. Is there any other way to achieve such a result? I realise I could probably write a for loop to do this, but I was hoping there might be a more efficient way.


Solution

  • Do a groupby sum on df1, and merge the result with df0:

    In [127]: df1.groupby("Date", as_index=False).sum()
    Out[127]: 
      Date    A   B    C
    0  1/1  163  75    0
    1  2/1    0   0  160
    
    In [128]: df0.merge(df1.groupby("Date", as_index=False).sum())
    Out[128]: 
      Date  Col1  Col2    A   B    C
    0  1/1   NaN   NaN  163  75    0
    1  2/1   NaN   NaN    0   0  160
    

    (I'm assuming here that Date is a column, and not an index, but it's easy to either reset_index() or use level=0 instead.)