Search code examples
pythoncsvcolumnsorting

Sum an unsorted list of columns in csv file?


I'm very new to Python (and very rusty on scripting, my last trials were bash and Perl in 2001 or so), have tried searching SO already -- but honestly didn't even know what to look for. I'm fairly sure it's trivial - and I'm a bit ashamed.

I have a fairly large CSV file (approx 26k rows), in a tab-separated format:

name, current_value, current_pct, change_pct
ItemA 2452434324     7,70%        -1,19
ItemB 342331         2,40%        -0,45
ItemC 32412123       3,90%        3,87
ItemD 0              0            -4,52
ItemE 12318231       14,80%       0
ItemA 542312134      1,60%        0,11
ItemC 2423423425     11,21%       -0,01
ItemE 3141888103     30,00%       0
ItemB 78826          1,01%        12,01
ItemA 89937          0,04%        0
...

There's total of approx 300 "Items" (repeating, but in various order, and sometimes only appearing once-or-twice), each of them has a "current value" (integer, from 0 to approx 1000 million (or 1 billion/milliard)), current percentage value (not interesting to me at the moment), and percentage change from last reading (different file, not interesting to me at the moment). What I want to achieve is a sum of 'change_pct' column for each Item, so for aforementioned example result would be:

name    total_pct_change
ItemA   -1,08
ItemB   11,56
ItemC   3,86
ItemD   -4,52
ItemE   0

I was planning to create a list of items then just sum the values of row[3] for these, but I fail miserably. What I have until now:

import csv, sys, string
xlsfile = sys.argv[1]
with open(xlsfile, 'rb') as f:
    reader = csv.reader(f, delimiter='\t')
    item = row[0]
    pct_change = row[3]
    # this is where I draw a blank
    # was thinking of something akin to
    #   foreach item do sum(pct_change)
    # but that's obviously wrong
    print item, sum_pct_change
f.close()

Solution

  • A pandas solution that works:

    import pandas as pd
    
    with open(xlsfile) as fobj:
        header = [entry.strip() for entry in next(fobj).split(',')]
    
    data = pd.read_csv(xlsfile, delim_whitespace=True, decimal=',', names=header, skiprows=1)
    summed = data.groupby(by=['name'])['change_pct'].sum()
    
    print(summed)
    

    Output:

    name
    ItemA    -1.08
    ItemB    11.56
    ItemC     3.86
    ItemD    -4.52
    ItemE     0.00
    Name: change_pct, dtype: float64
    

    EDIT

    If your file is ;-separated. This should work:

    data = pd.read_csv('pct2.csv', sep=';', decimal=',')
    summed = data.groupby(by=['name'])['change_pct'].sum()
    print(summed)