Search code examples
pythonpandascsvsplitsum

How do I read a csv of national data, where first column is "state", second "town", and sum columns by state? (Python/Pandas)


Every day I create a csv that looks something like this.

Sector Code Col1 Col2 Col3 Col4
Sect1
CodeA data data data data
CodeB data data data data
CodeC data data data data
CodeD data data data data
Sect2
CodeE data data data data
CodeF data data data data
CodeG data data data data
CodeH data data data data
Sect3
CodeI data data data data
CodeJ data data data data
CodeK data data data data
CodeL data data data data
SectN
CodeX data data data data
CodeY data data data data
CodeZ data data data data

Note the blank cells: the whole row after the sector name, and the 'sector' column before each code.

Id like to read the csv and EITHER:

  1. Sum the columns by sector, creating a dataframe that should look something like this:
0 1 2 3 4
Sector1 col1total col2total col3total col4total
Sector2 col1total col2total col3total col4total
Sector3 col1total col2total col3total col4total
SectorN col1total col2total col3total col4total

or

  1. Split the csv into "sector" dataframes and work with each one individually.

Im a noob and might have bitten off more than I can chew! Currently Im separating the original csv into sector csvs and summing that way using the following code, but as you can imagine its time consuming. Its not really relevant to my question, but heres the code I use:

import pandas

with open('D:\Libraries\Downloads\igcxtest.csv', 'r') as fileObject: reader_object = pandas.read_csv(fileObject, delimiter=';', decimal=',', header=0, names=('Noticia','Ativo','Variacao','Maximo', 'Ultimo','Minimo','Negocios','Semana', 'Mes','3 meses','6 meses','12 meses', 'Volume','Fourpercent','eightpercent', 'twelvepercent','plus25m','plus25q', 'plus50m','plus50q','minusfourpercent', 'minuseightpercent','minustwelvepercent', 'minus25m','minus25q','minus50m','minus50q', 'weekhigh','Twoweekhigh','Threeweekhigh', 'Threemthhigh','Yearhigh','weeklow', 'Twoweeklow','Threeweeklow','Threemthlow', 'Yearlow','Over200','Over50','Over21','Grafico')) # print(reader_object) reader_object.drop(['Noticia', 'Maximo', 'Ultimo','Minimo','Negocios','Semana','Mes','3 meses','6 meses','12 meses','Grafico'], axis=1, inplace=True) #print(reader_object) print(reader_object['Threemthhigh'])

reader_object['up_or_down'] = reader_object['Variacao'].apply(lambda x: 1 if x > 0 else (-1 if x < 0 else 0))
#Calculate ups and downs
ups = 0
downs = 0
up_down = []
for val in reader_object['Variacao']:
    if val >0:
        ups+=1
    elif val < 0:
        downs+=1
up_down = [ups, downs]
print(up_down)

#Calculate cumulative volume
reader_object['Volume'] = reader_object['Volume'] * reader_object['up_or_down']
#print(reader_object)
vol = 0
for v in reader_object['Volume']:
    vol += v

reader_object.drop(['Variacao', 'up_or_down'], axis=1, inplace=True)
print(reader_object)
#print(ups, downs, vol)

#Sum columns
sums = reader_object.sum(numeric_only=True)
print(sums)
sums_list = []
x=0
for x in sums:
    sums_list.append(x)
sums_list = up_down + sums_list
print(sums_list)
sum_df = pandas.DataFrame(sums_list).T
#print(sum_df)
sum_df.to_csv('D:\Libraries\Downloads\sums.csv')

I cant find info on ways to read the original csv, and split it by using (I imagine) keywords in rows ("sectorN") or by numbers of lines (how many "codes" in each "sector") first.

Any suggestions? Thanks...


Solution

  • You could try this:

    1. Read the csv into a dataframe
    2. Forward-fill the State column df['State'].fillna(method = 'ffill', inplace = True)
    3. Filter out rows where Town is null df2 = df.query('Town is not null')
    4. Aggregate / group by State and Town, and sum each of the measure columns df3 = df2.groupby(['State', 'Town']).sum().reset_index()