There is a column in my csv called cost, which I want to sum based on another column, called factory, to basically create a breakdown of cost by factory. I have rows such as the following, where there are multiple costs for each factory:
Factory,Cost,Cost_Type
Bali,23,0
Sydney,21,1
Sydney,4,2
Denver,8,1
Bali,9,1
I'd like to be able to quickly sum the cost per factory, and save these values to a variable. I think one way to do this is by looping through a list of factories, which then loops through the csv. Here is where I've got to:
factories= ['Bali', 'Sydney', 'Denver']
totalcost = 0
balicost = 0
sydneycost = 0
denvercost = 0
for factory in factories:
for row in csv.reader(costcsv):
if row[0] == factory:
Where I'm stuck is that I don't know how to change the variable which is being added to for the different factories, balicost, sydneycost and denvercost. The simplified version, where I'm just getting the total of the cost column was as follows:
for row in csv.reader(costcsv):
totalcost += float(row[1])
I'm more than open to different approaches than this (I believe dictionaries could come into it), and appreciate any points in the right direction.
[Community wiki, because it's a little tangential.]
When you're processing tabular data in Python, you should consider the pandas
library. The operation you want to perform is a groupby sum, and that's easily done in two lines:
df = pd.read_csv("factories.csv")
by_factory = df.groupby("Factory")["Cost"].sum()
which produces a Series
object you can index into like a dictionary:
>>> by_factory
Factory
Bali 32
Denver 8
Sydney 25
Name: Cost, dtype: int64
>>> by_factory["Bali"]
32
Update, using the updated data-- if you also want to handle Cost_Type
, you have several options. One is to select only the rows with Cost_Type == 1:
>>> df[df.Cost_Type == 1]
Factory Cost Cost_Type
1 Sydney 21 1
3 Denver 8 1
4 Bali 9 1
[3 rows x 3 columns]
>>> df[df.Cost_Type == 1].groupby("Factory")["Cost"].sum()
Factory
Bali 9
Denver 8
Sydney 21
Name: Cost, dtype: int64
or you can expand the groupby and group on both Factory
and Cost_Type
simultaneously:
>>> df.groupby(["Cost_Type", "Factory"])["Cost"].sum()
Cost_Type Factory
0 Bali 23
1 Bali 9
Denver 8
Sydney 21
2 Sydney 4
Name: Cost, dtype: int64