Search code examples
pythoncsvpandasdataframegrouping

Group by region and date using Pandas


temp.csv

I have a csv file , please check the image for the sample input csv and I need to get a dataframe that has the sum of the "Amazon Elastic compute cloud" services running on a particular "Availability zone" group it based on the date.

Something like this

|UsageStartDate| AvaliabilityZone  | Sum of products used |  Total cost for each
 
[6/1/16,  ap-northeast-1a, Amazon Elastic compute cloud = 6, 15$]
[6/2/16,  ap-southeast-2 , Amazon Elastic compute cloud = 3,   12$]

This is how i tried using pandas :

funk = pd.read_csv('/tmp/temp.csv')
funk.sort_values('UsageStartDate') 
k = funk['AvailabilityZone'][funk['ProductName'] == 'Amazon Elastic Compute Cloud'].sum()
print  k 

Any help on this?

Here is the data:

    ProductName               AvailabilityZone  UsageStartDate  BlendedCost
0   Amazon Simple Queue Service                   6/1/16 0:00       0
1   Alexa Web Information Service                 6/1/16 0:00       0.00347032
2   Amazon DynamoDB        ap-southeast-2          6/1/16 0:00      0
3   Amazon DynamoDB        ap-southeast-2          6/1/16 0:00      0
4   Amazon Elastic Compute Cloud    ap-northeast-1a 6/1/16 0:00     0.1
5   Amazon Elastic Compute Cloud    ap-northeast-1a 6/1/16 0:00     0.02
6   Amazon Elastic Compute Cloud                     6/1/16 0:00    0
7   Amazon Elastic Compute Cloud                     6/1/16 0:00    0
8   Amazon Elastic Compute Cloud                     6/1/16 0:00    4.70E-06
9   Amazon Elastic Compute Cloud                     6/1/16 0:00    8.00E-08
10  Amazon Elastic Compute Cloud                     6/1/16 0:00    0.00133333
11  Amazon Elastic Compute Cloud                     6/1/16 0:00    0.005
12  Amazon Elastic Compute Cloud    ap-southeast-1a 6/1/16 0:00     0.02
13  Amazon Elastic Compute Cloud    ap-southeast-1a 6/1/16 0:00     0.02
14  Amazon Elastic Compute Cloud    ap-southeast-1b 6/1/16 0:00     0.02
15  Amazon Elastic Compute Cloud                    6/1/16 0:00     0

Solution

  • I think you need groupby with aggregate - size by len of column AvailabilityZone and sum od column BlendedCost:

    print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
             .agg({'AvailabilityZone':len,
                   'BlendedCost':sum}))
    

    Sample:

    import pandas as pd
    
    raw_data = {
        'ProductName': ['ASQS', 'AWIS', 'AWIS', 'AECC', 'AECC'], 
        'UsageStartDate': ['6/1/16','6/1/16','6/1/16','6/1/16','6/1/16'],
        'AvailabilityZone':['ap-northeast-1a','ap-northeast-1a','ap-northeast-1a','ap-southeast-2','ap-southeast-2'],
        'BlendedCost':[1,2,3,4,5]}
    df = pd.DataFrame(raw_data)
    print (df)
      AvailabilityZone  BlendedCost ProductName UsageStartDate
    0  ap-northeast-1a            1        ASQS         6/1/16
    1  ap-northeast-1a            2        AWIS         6/1/16
    2  ap-northeast-1a            3        AWIS         6/1/16
    3   ap-southeast-2            4        AECC         6/1/16
    4   ap-southeast-2            5        AECC         6/1/16
    
    print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
             .agg({'AvailabilityZone':len,'BlendedCost':sum})
             .rename(columns={'AvailabilityZone':'Sum of products used', 'BlendedCost':'Total'})
             .reset_index())
    
      UsageStartDate AvailabilityZone ProductName  Sum of products used  Total
    0         6/1/16  ap-northeast-1a        ASQS                     1      1
    1         6/1/16  ap-northeast-1a        AWIS                     2      5
    2         6/1/16   ap-southeast-2        AECC                     2      9
    

    Solution with sample data:

    import pandas as pd
    import io
    
    temp=u"""ProductName;AvailabilityZone;UsageStartDate;BlendedCost
    Amazon Simple Queue Service;;6/1/16 0:00;0
    Alexa Web Information Service;;6/1/16 0:00;0.00347032
    Amazon DynamoDB;ap-southeast-2;6/1/16 0:00;0
    Amazon DynamoDB;ap-southeast-2;6/1/16 0:00;0
    Amazon Elastic Compute Cloud;ap-northeast-1a;6/1/16 0:00;0.1
    Amazon Elastic Compute Cloud;ap-northeast-1a;6/1/16 0:00;0.02
    Amazon Elastic Compute Cloud;;6/1/16 0:00;0
    Amazon Elastic Compute Cloud;;6/1/16 0:00;0
    Amazon Elastic Compute Cloud;;6/1/16 0:00;4.70E-06
    Amazon Elastic Compute Cloud;;6/1/16 0:00;8.00E-08
    Amazon Elastic Compute Cloud;;6/1/16 0:00;0.00133333
    Amazon Elastic Compute Cloud;;6/1/16 0:00;0.005
    Amazon Elastic Compute Cloud;ap-southeast-1a;6/1/16 0:00;0.02
    Amazon Elastic Compute Cloud;ap-southeast-1a;6/1/16 0:00;0.02
    Amazon Elastic Compute Cloud;ap-southeast-1b;6/1/16 0:00;0.02
    Amazon Elastic Compute Cloud;;6/1/16 0:00;0"""
    #after testing replace io.StringIO(temp) to filename
    df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None
    
    #print (df)
    
    print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
             .agg({'AvailabilityZone':len,'BlendedCost':sum})
             .rename(columns={'AvailabilityZone':'Sum of products used', 'BlendedCost':'Total'})
             .reset_index())
    
      UsageStartDate AvailabilityZone                   ProductName  \
    0    6/1/16 0:00  ap-northeast-1a  Amazon Elastic Compute Cloud   
    1    6/1/16 0:00  ap-southeast-1a  Amazon Elastic Compute Cloud   
    2    6/1/16 0:00  ap-southeast-1b  Amazon Elastic Compute Cloud   
    3    6/1/16 0:00   ap-southeast-2               Amazon DynamoDB   
    
       Sum of products used  Total  
    0                     2   0.12  
    1                     2   0.04  
    2                     1   0.02  
    3                     2   0.00