I have a txt file with the following content:
ID1;ID2;TIME;VALUE
1000;100;012021;12
1000;100;022021;4129
1000;100;032021;128
1000;100;042021;412
1000;100;052021;12818
1000;120;022021;4129
1000;100;062021;546
1000;100;072021;86
1000;120;052021;12818
1000;100;082021;754
1000;100;092021;2633
1000;100;102021;571
1000;120;092021;2633
1000;100;112021;2233
1000;100;122021;571
1000;120;012021;12
1000;120;032021;128
1000;120;042021;412
1000;120;062021;546
1000;120;072021;86
1000;120;082021;754
1000;120;102021;571
1000;120;112021;2233
1000;120;122021;571
1000;100;012022;12
1000;100;022022;4129
1000;120;022022;4129
1000;120;032022;128
1000;120;042022;412
1000;100;032022;128
1000;100;042022;412
1000;100;052022;12818
1000;100;062022;546
1000;100;072022;86
1000;100;082022;754
1000;120;072022;86
1000;120;082022;754
1000;120;092022;2633
1000;120;102022;571
1000;100;092022;2633
1000;100;102022;571
1000;100;112022;2233
1000;100;122022;571
1000;120;012022;12
1000;120;052022;12818
1000;120;062022;546
1000;120;112022;2233
1000;120;122022;571
I need to make aggregates of time (half year, total year), using the items from column time, which have the same ID1, ID2 and sum up the values.
The output should look like this:
I would appreciate your help! This is what I have so far for half year: #already sorted by time
data=open("file.txt").readlines()
count = 0
for line in data:
count += 1
for n in range(count - 1, len(data), 6):
subList = [data[n:n + 6]]
break
This is a task that is very well suited for the pandas library, which is designed to work with tabular data. A way to do this in pandas would be something like this:
import pandas as pd
# Read the data from the textfile (here I saved it as data.csv).
# Make sure the TIME column is not read as an integer by declaring
# dtype={'TIME': object}, because this would omit the leading zeroes, which
# we need for conversion to datetime object
df = pd.read_csv('data.csv', delimiter=';', dtype={'TIME': object})
# Convert TIME column to datetime
df['TIME'] = pd.to_datetime(df['TIME'], format='%m%Y')
# Create new column with year
df['Y'] = df['TIME'].dt.year
# Create new column with halfyear (1 = first halfyear, 2 = second halfyear)
df['HY'] = df['TIME'].dt.month.floordiv(7).add(1)
After this, your table looks like this:
df.head() # Just show the first couple of rows
ID1 ID2 TIME VALUE Y HY
0 1000 100 2021-01-20 12 2021 1
1 1000 100 2021-02-20 4129 2021 1
2 1000 100 2021-03-20 128 2021 1
3 1000 100 2021-04-20 412 2021 1
4 1000 100 2021-05-20 12818 2021 1
Getting the table into the desired format takes a bit of work, but grouping and aggregating then becomes really easy. You can then also perform other grouping and aggregating operations as you please without having to code it all from hand.
To group by year and calculate the sum:
df.groupby(['ID1', 'ID2', 'Y']).sum()
ID1 ID2 Y VALUE
1000 100 2021 24893
1000 100 2022 24893
1000 120 2021 24893
1000 120 2022 24893
To group by halfyear and calculate the sum:
df.groupby(['ID1', 'ID2', 'Y', 'HY']).sum()
ID1 ID2 Y HY VALUE
1000 100 2021 1 18045
1000 100 2021 2 6848
1000 100 2022 1 18045
1000 100 2022 2 6848
1000 120 2021 1 18045
1000 120 2021 2 6848
1000 120 2022 1 18045
1000 120 2022 2 6848
Edit: Added a datetime format specifier to correctly read the date as MMYYYY instead of MMDDYY. Thanks to shunty for mentioning it! The results shown in this post will of course be different now than the actual results.