Search code examples
pythoncsvfor-loopdata-analysis

Python - sumif from csv


There is a column in my csv called cost, which I want to sum based on another column, called factory, to basically create a breakdown of cost by factory. I have rows such as the following, where there are multiple costs for each factory:

Factory,Cost,Cost_Type
Bali,23,0
Sydney,21,1
Sydney,4,2
Denver,8,1
Bali,9,1

I'd like to be able to quickly sum the cost per factory, and save these values to a variable. I think one way to do this is by looping through a list of factories, which then loops through the csv. Here is where I've got to:

factories= ['Bali', 'Sydney', 'Denver']
totalcost = 0
balicost = 0
sydneycost = 0
denvercost = 0

for factory in factories:
    for row in csv.reader(costcsv):
        if row[0] == factory:

Where I'm stuck is that I don't know how to change the variable which is being added to for the different factories, balicost, sydneycost and denvercost. The simplified version, where I'm just getting the total of the cost column was as follows:

for row in csv.reader(costcsv):
        totalcost += float(row[1])

I'm more than open to different approaches than this (I believe dictionaries could come into it), and appreciate any points in the right direction.


Solution

  • [Community wiki, because it's a little tangential.]

    When you're processing tabular data in Python, you should consider the pandas library. The operation you want to perform is a groupby sum, and that's easily done in two lines:

    df = pd.read_csv("factories.csv")
    by_factory = df.groupby("Factory")["Cost"].sum()
    

    which produces a Series object you can index into like a dictionary:

    >>> by_factory
    Factory
    Bali       32
    Denver      8
    Sydney     25
    Name: Cost, dtype: int64
    >>> by_factory["Bali"]
    32
    

    Update, using the updated data-- if you also want to handle Cost_Type, you have several options. One is to select only the rows with Cost_Type == 1:

    >>> df[df.Cost_Type == 1]
      Factory  Cost  Cost_Type
    1  Sydney    21          1
    3  Denver     8          1
    4    Bali     9          1
    
    [3 rows x 3 columns]
    >>> df[df.Cost_Type == 1].groupby("Factory")["Cost"].sum()
    Factory
    Bali        9
    Denver      8
    Sydney     21
    Name: Cost, dtype: int64
    

    or you can expand the groupby and group on both Factory and Cost_Type simultaneously:

    >>> df.groupby(["Cost_Type", "Factory"])["Cost"].sum()
    Cost_Type  Factory
    0          Bali       23
    1          Bali        9
               Denver      8
               Sydney     21
    2          Sydney      4
    Name: Cost, dtype: int64