I have following dataframe
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
'Duration':['30days','50days','55days','40days','60days','35days','55days','50days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
})
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount'])
print(df)
How can I groupby 'Courses' and sum only the Fee table but keep all the rest of the columns? When doing:
df2 = df.groupby('Courses').sum()
Both Fee and Discount are summed up + Duration is not in the resulting dataframe.
When doing:
df.groupby('Courses')['Fee'].sum()
I end up with only the Fee per Course.
Expected output:
Fee Duration Discount
Courses
Hadoop 48000 90days 2300
Pandas 26000 60days 2500
PySpark 25000 50days 2300
Python 46000 90days 2800
Spark 47000 85days 2400
Edit: My original DataFrame has more columns.
Any good approach for my question?
You can convert column Duration
to timedeltas by to_timedelta
and then aggregate sum
with parameter numeric_only=False
:
df['Duration'] = pd.to_timedelta(df['Duration'])
df2 = df.groupby('Courses').sum(numeric_only=False)
print (df2)
Fee Duration Discount
Courses
Hadoop 48000 90 days 2300
Pandas 26000 60 days 2500
PySpark 25000 50 days 2300
Python 46000 90 days 2800
Spark 47000 85 days 2400
EDIT: You can specify columns for aggregation sum
in dictionary and for all another columns aggregate first
:
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python"],
'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000],
'Duration':['30days','50days','55days','40days','60days','35days','55days','50days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400,1600],
'col': 'a' + pd.Series(np.arange(8).astype(str))
})
df = pd.DataFrame(technologies, columns=['Courses','Fee','Duration','Discount','col'])
print(df)
Courses Fee Duration Discount col
0 Spark 22000 30days 1000 a0
1 PySpark 25000 50days 2300 a1
2 Hadoop 23000 55days 1000 a2
3 Python 24000 40days 1200 a3
4 Pandas 26000 60days 2500 a4
5 Hadoop 25000 35days 1300 a5
6 Spark 25000 55days 1400 a6
7 Python 22000 50days 1600 a7
cols = ['Fee','Duration','Discount']
df['Duration'] = pd.to_timedelta(df['Duration'])
d = dict.fromkeys(cols, 'sum')
another = dict.fromkeys(df.columns.drop('Courses'), 'first')
df2 = df.groupby('Courses').agg({**d, **another})
print (df2)
Fee Duration Discount col
Courses
Hadoop 23000 55 days 1000 a2
Pandas 26000 60 days 2500 a4
PySpark 25000 50 days 2300 a1
Python 24000 40 days 1200 a3
Spark 22000 30 days 1000 a0
Another approach is use GroupBy.transform
for columns from list:
cols = ['Fee','Duration','Discount']
df['Duration'] = pd.to_timedelta(df['Duration'])
df[cols] = df.groupby('Courses')[cols].transform('sum')
print (df)
Courses Fee Duration Discount col
0 Spark 47000 85 days 2400 a0
1 PySpark 25000 50 days 2300 a1
2 Hadoop 48000 90 days 2300 a2
3 Python 46000 90 days 2800 a3
4 Pandas 26000 60 days 2500 a4
5 Hadoop 48000 90 days 2300 a5
6 Spark 47000 85 days 2400 a6
7 Python 46000 90 days 2800 a7
And then for first rows per Courses
use DataFrame.drop_duplicates
:
df2 = df.drop_duplicates('Courses')
print (df2)
Courses Fee Duration Discount col
0 Spark 47000 85 days 2400 a0
1 PySpark 25000 50 days 2300 a1
2 Hadoop 48000 90 days 2300 a2
3 Python 46000 90 days 2800 a3
4 Pandas 26000 60 days 2500 a4