Group by region and date using Pandas

I have a csv file , please check the image for the sample input csv and I need to get a dataframe that has the sum of the "Amazon Elastic compute cloud" services running on a particular "Availability zone" group it based on the date.

Something like this

|UsageStartDate| AvaliabilityZone  | Sum of products used |  Total cost for each
 
[6/1/16,  ap-northeast-1a, Amazon Elastic compute cloud = 6, 15$]
[6/2/16,  ap-southeast-2 , Amazon Elastic compute cloud = 3,   12$]

This is how i tried using pandas :

funk = pd.read_csv('/tmp/temp.csv')
funk.sort_values('UsageStartDate') 
k = funk['AvailabilityZone'][funk['ProductName'] == 'Amazon Elastic Compute Cloud'].sum()
print  k

Any help on this?

Here is the data:

    ProductName               AvailabilityZone  UsageStartDate  BlendedCost
0   Amazon Simple Queue Service                   6/1/16 0:00       0
1   Alexa Web Information Service                 6/1/16 0:00       0.00347032
2   Amazon DynamoDB        ap-southeast-2          6/1/16 0:00      0
3   Amazon DynamoDB        ap-southeast-2          6/1/16 0:00      0
4   Amazon Elastic Compute Cloud    ap-northeast-1a 6/1/16 0:00     0.1
5   Amazon Elastic Compute Cloud    ap-northeast-1a 6/1/16 0:00     0.02
6   Amazon Elastic Compute Cloud                     6/1/16 0:00    0
7   Amazon Elastic Compute Cloud                     6/1/16 0:00    0
8   Amazon Elastic Compute Cloud                     6/1/16 0:00    4.70E-06
9   Amazon Elastic Compute Cloud                     6/1/16 0:00    8.00E-08
10  Amazon Elastic Compute Cloud                     6/1/16 0:00    0.00133333
11  Amazon Elastic Compute Cloud                     6/1/16 0:00    0.005
12  Amazon Elastic Compute Cloud    ap-southeast-1a 6/1/16 0:00     0.02
13  Amazon Elastic Compute Cloud    ap-southeast-1a 6/1/16 0:00     0.02
14  Amazon Elastic Compute Cloud    ap-southeast-1b 6/1/16 0:00     0.02
15  Amazon Elastic Compute Cloud                    6/1/16 0:00     0

Solution

I think you need groupby with aggregate - size by len of column AvailabilityZone and sum od column BlendedCost:

print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
         .agg({'AvailabilityZone':len,
               'BlendedCost':sum}))

Sample:

import pandas as pd

raw_data = {
    'ProductName': ['ASQS', 'AWIS', 'AWIS', 'AECC', 'AECC'], 
    'UsageStartDate': ['6/1/16','6/1/16','6/1/16','6/1/16','6/1/16'],
    'AvailabilityZone':['ap-northeast-1a','ap-northeast-1a','ap-northeast-1a','ap-southeast-2','ap-southeast-2'],
    'BlendedCost':[1,2,3,4,5]}
df = pd.DataFrame(raw_data)
print (df)
  AvailabilityZone  BlendedCost ProductName UsageStartDate
0  ap-northeast-1a            1        ASQS         6/1/16
1  ap-northeast-1a            2        AWIS         6/1/16
2  ap-northeast-1a            3        AWIS         6/1/16
3   ap-southeast-2            4        AECC         6/1/16
4   ap-southeast-2            5        AECC         6/1/16

print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
         .agg({'AvailabilityZone':len,'BlendedCost':sum})
         .rename(columns={'AvailabilityZone':'Sum of products used', 'BlendedCost':'Total'})
         .reset_index())

  UsageStartDate AvailabilityZone ProductName  Sum of products used  Total
0         6/1/16  ap-northeast-1a        ASQS                     1      1
1         6/1/16  ap-northeast-1a        AWIS                     2      5
2         6/1/16   ap-southeast-2        AECC                     2      9

Solution with sample data:

import pandas as pd
import io

temp=u"""ProductName;AvailabilityZone;UsageStartDate;BlendedCost
Amazon Simple Queue Service;;6/1/16 0:00;0
Alexa Web Information Service;;6/1/16 0:00;0.00347032
Amazon DynamoDB;ap-southeast-2;6/1/16 0:00;0
Amazon DynamoDB;ap-southeast-2;6/1/16 0:00;0
Amazon Elastic Compute Cloud;ap-northeast-1a;6/1/16 0:00;0.1
Amazon Elastic Compute Cloud;ap-northeast-1a;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;;6/1/16 0:00;0
Amazon Elastic Compute Cloud;;6/1/16 0:00;0
Amazon Elastic Compute Cloud;;6/1/16 0:00;4.70E-06
Amazon Elastic Compute Cloud;;6/1/16 0:00;8.00E-08
Amazon Elastic Compute Cloud;;6/1/16 0:00;0.00133333
Amazon Elastic Compute Cloud;;6/1/16 0:00;0.005
Amazon Elastic Compute Cloud;ap-southeast-1a;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;ap-southeast-1a;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;ap-southeast-1b;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;;6/1/16 0:00;0"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None

#print (df)

print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
         .agg({'AvailabilityZone':len,'BlendedCost':sum})
         .rename(columns={'AvailabilityZone':'Sum of products used', 'BlendedCost':'Total'})
         .reset_index())

  UsageStartDate AvailabilityZone                   ProductName  \
0    6/1/16 0:00  ap-northeast-1a  Amazon Elastic Compute Cloud   
1    6/1/16 0:00  ap-southeast-1a  Amazon Elastic Compute Cloud   
2    6/1/16 0:00  ap-southeast-1b  Amazon Elastic Compute Cloud   
3    6/1/16 0:00   ap-southeast-2               Amazon DynamoDB   

   Sum of products used  Total  
0                     2   0.12  
1                     2   0.04  
2                     1   0.02  
3                     2   0.00