Search code examples
pythonpandassumrowaggregate-functions

Getting row sums for specific columns - Python


I created a simple dataset df with three columns, Area, Year_2010, Year_2020. The related code:

# Import pandas library

import pandas as pd
  
# initialize list of lists
data = [['Netherlands', 100, 200], ['Belgium', 15, 80], ['Germany', 125, 300]]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Area', 'Year_2010', 'Year_2020'])
  
# print dataframe.
df

#prints
          Area  Year_2010  Year_2020
0  Netherlands        100        200
1      Belgium         15         80
2      Germany        125        300

For two of the three columns (being Year_2010 and Year_2020), I would like to obtain the sum of all related rows. The output of the sum should appear in a new row for both columns.

Ideally, the output reads:

                        Area  Year_2010  Year_2020
0                Netherlands        100        200
1                    Belgium         15         80
2                    Germany        125        300
3                                   240        580

However, if I apply

total = df.sum(axis=0)
df = df.append(total, ignore_index=True)

The Area column will also be aggregated, leading to concatenation of country names in the Area column:

                        Area  Year_2010  Year_2020
0                Netherlands        100        200
1                    Belgium         15         80
2                    Germany        125        300
3  NetherlandsBelgiumGermany        240        580

How can I get the row sums for specific columns, while leaving out other columns?

As an alternative to leaving the value NetherlandsBelgiumGermany blank, the value can be replaced by Total.


Solution

  • One solution could be to filter the columns by type and use indexing:

    cols = df.select_dtypes('number').columns
    
    df.loc[len(df), cols] = df[cols].sum()
    

    updated DataFrame:

              Area  Year_2010  Year_2020
    0  Netherlands      100.0      200.0
    1      Belgium       15.0       80.0
    2      Germany      125.0      300.0
    3          NaN      240.0      580.0
    

    Alternative if you want the empty string:

    cols = df.select_dtypes('number').columns
    
    df.loc[len(df)] = df[cols].sum().reindex(df.columns, fill_value='')
    

    output:

              Area  Year_2010  Year_2020
    0  Netherlands        100        200
    1      Belgium         15         80
    2      Germany        125        300
    3                     240        580
    

    NB. this assumes a range index, if this is not the case, please update the example for alternative ways (using pandas.concat for example).