How do I read a csv of national data, where first column is "state", second "town", and sum columns by state? (Python/Pandas)

Every day I create a csv that looks something like this.

Sector	Code	Col1	Col2	Col3	Col4
Sect1
	CodeA	data	data	data	data
	CodeB	data	data	data	data
	CodeC	data	data	data	data
	CodeD	data	data	data	data
Sect2
	CodeE	data	data	data	data
	CodeF	data	data	data	data
	CodeG	data	data	data	data
	CodeH	data	data	data	data
Sect3
	CodeI	data	data	data	data
	CodeJ	data	data	data	data
	CodeK	data	data	data	data
	CodeL	data	data	data	data
SectN
	CodeX	data	data	data	data
	CodeY	data	data	data	data
	CodeZ	data	data	data	data

Note the blank cells: the whole row after the sector name, and the 'sector' column before each code.

Id like to read the csv and EITHER:

Sum the columns by sector, creating a dataframe that should look something like this:

0	1	2	3	4
Sector1	col1total	col2total	col3total	col4total
Sector2	col1total	col2total	col3total	col4total
Sector3	col1total	col2total	col3total	col4total
SectorN	col1total	col2total	col3total	col4total

Split the csv into "sector" dataframes and work with each one individually.

Im a noob and might have bitten off more than I can chew! Currently Im separating the original csv into sector csvs and summing that way using the following code, but as you can imagine its time consuming. Its not really relevant to my question, but heres the code I use:

import pandas

with open('D:\Libraries\Downloads\igcxtest.csv', 'r') as fileObject: reader_object = pandas.read_csv(fileObject, delimiter=';', decimal=',', header=0, names=('Noticia','Ativo','Variacao','Maximo', 'Ultimo','Minimo','Negocios','Semana', 'Mes','3 meses','6 meses','12 meses', 'Volume','Fourpercent','eightpercent', 'twelvepercent','plus25m','plus25q', 'plus50m','plus50q','minusfourpercent', 'minuseightpercent','minustwelvepercent', 'minus25m','minus25q','minus50m','minus50q', 'weekhigh','Twoweekhigh','Threeweekhigh', 'Threemthhigh','Yearhigh','weeklow', 'Twoweeklow','Threeweeklow','Threemthlow', 'Yearlow','Over200','Over50','Over21','Grafico')) # print(reader_object) reader_object.drop(['Noticia', 'Maximo', 'Ultimo','Minimo','Negocios','Semana','Mes','3 meses','6 meses','12 meses','Grafico'], axis=1, inplace=True) #print(reader_object) print(reader_object['Threemthhigh'])

reader_object['up_or_down'] = reader_object['Variacao'].apply(lambda x: 1 if x > 0 else (-1 if x < 0 else 0))
#Calculate ups and downs
ups = 0
downs = 0
up_down = []
for val in reader_object['Variacao']:
    if val >0:
        ups+=1
    elif val < 0:
        downs+=1
up_down = [ups, downs]
print(up_down)

#Calculate cumulative volume
reader_object['Volume'] = reader_object['Volume'] * reader_object['up_or_down']
#print(reader_object)
vol = 0
for v in reader_object['Volume']:
    vol += v

reader_object.drop(['Variacao', 'up_or_down'], axis=1, inplace=True)
print(reader_object)
#print(ups, downs, vol)

#Sum columns
sums = reader_object.sum(numeric_only=True)
print(sums)
sums_list = []
x=0
for x in sums:
    sums_list.append(x)
sums_list = up_down + sums_list
print(sums_list)
sum_df = pandas.DataFrame(sums_list).T
#print(sum_df)
sum_df.to_csv('D:\Libraries\Downloads\sums.csv')

I cant find info on ways to read the original csv, and split it by using (I imagine) keywords in rows ("sectorN") or by numbers of lines (how many "codes" in each "sector") first.

Any suggestions? Thanks...

Solution

You could try this:

Read the csv into a dataframe
Forward-fill the State column df['State'].fillna(method = 'ffill', inplace = True)
Filter out rows where Town is null df2 = df.query('Town is not null')
Aggregate / group by State and Town, and sum each of the measure columns df3 = df2.groupby(['State', 'Town']).sum().reset_index()