I'm trying to generate reports using Pandas, grouping by a set of fields:
This is what I'm doing:
#!/usr/bin/env python3
import pandas as pd
data = [
{
'id': 1,
'name': 'name1',
'pretty_name': 'Pretty Name 1',
'server_name': 'exampleserver.local',
'provider': 'provider1',
'type': 'A',
'status': 'KO'
},
{
'id': 2,
'name': 'name2',
'pretty_name': 'Pretty Name 2',
'server_name': 'exampleserver1.local',
'provider': 'provider2',
'type': 'B',
'status': 'OK'
},
{
'id': 1,
'name': 'name1',
'pretty_name': 'Pretty Name 1',
'server_name': 'exampleserver.local',
'provider': 'provider1',
'type': 'A',
'status': 'KO'
},
{
'id': 1,
'name': 'name1',
'pretty_name': 'Pretty Name 1',
'server_name': 'exampleserver.local',
'provider': 'provider1',
'type': 'A',
'status': 'OK'
},
{
'id': 2,
'name': 'name2',
'pretty_name': 'Pretty Name 2',
'server_name': 'exampleserver.local',
'provider': 'provider2',
'type': 'A',
'status': 'OK'
}
]
df = pd.DataFrame(data)
grouped = df.groupby(['server_name', 'provider', 'type', 'status'])['id'].count()
print(grouped.to_string())
Which returns:
server_name provider type status
exampleserver.local provider1 A KO 2
OK 1
provider2 A OK 1
exampleserver1.local provider2 B OK 1
This is alright, but I would like to add to the result a row containing the total for each provider. I.e.
server_name provider tot type status
exampleserver.local provider1 3 A KO 2
OK 1
provider2 1 A OK 1
exampleserver1.local provider2 1 B OK 1
I'm pretty sure this can be done quite easily with Pandas, but I've spent hours reading documentation with no luck.
Any pointers?
Thanks.
EDIT: I've corrected and extended the example as it didn't really made sense.
You can create helper column for compare if match provider1
with DataFrame.assign
and Series.eq
, convert to integers, so you can use sum
for count matched values:
grouped = (df.assign(new=df['provider'].str.contains('provider1').astype(int))
.groupby(['server_name', 'provider', 'type', 'status'])['new']
.agg([('count','size'), ('provider1_count','sum')])
.reset_index())
print (grouped)
server_name provider type status count provider1_count
0 exampleserver.local provider1 A KO 1 1
1 exampleserver.local provider2 A OK 1 0
2 exampleserver.local provider2 B OK 1 0
EDIT:
You can add as_index=False
for DataFrame
and rename
column:
df1 = (df.groupby(['server_name', 'provider', 'type', 'status'], as_index=False)['id']
.count()
.rename(columns={'id':'counts'}))
Then if want new column in position 2
use DataFrame.insert
with GroupBy.transform
:
df1.insert(2, 'tot', df1.groupby(['server_name','provider'])['counts'].transform('sum'))
print(df1)
server_name provider tot type status counts
0 exampleserver.local provider1 3 A KO 2
1 exampleserver.local provider1 3 A OK 1
2 exampleserver.local provider2 1 A OK 1
3 exampleserver1.local provider2 1 B OK 1
And last if need Multiindex
use DataFrame.set_index
:
grouped = df1.set_index(['server_name', 'provider', 'tot','type', 'status'])['counts']
print (grouped)
server_name provider tot type status
exampleserver.local provider1 3 A KO 2
OK 1
provider2 1 A OK 1
exampleserver1.local provider2 1 B OK 1
Name: counts, dtype: int64