Search code examples
pythondatabaseaggregategrouping

Aggregates in python


I have a txt file with the following content:

ID1;ID2;TIME;VALUE  
1000;100;012021;12
1000;100;022021;4129
1000;100;032021;128
1000;100;042021;412
1000;100;052021;12818
1000;120;022021;4129
1000;100;062021;546
1000;100;072021;86
1000;120;052021;12818
1000;100;082021;754
1000;100;092021;2633
1000;100;102021;571
1000;120;092021;2633
1000;100;112021;2233
1000;100;122021;571
1000;120;012021;12
1000;120;032021;128
1000;120;042021;412
1000;120;062021;546
1000;120;072021;86
1000;120;082021;754
1000;120;102021;571
1000;120;112021;2233
1000;120;122021;571
1000;100;012022;12
1000;100;022022;4129
1000;120;022022;4129
1000;120;032022;128
1000;120;042022;412
1000;100;032022;128
1000;100;042022;412
1000;100;052022;12818
1000;100;062022;546
1000;100;072022;86
1000;100;082022;754
1000;120;072022;86
1000;120;082022;754
1000;120;092022;2633
1000;120;102022;571
1000;100;092022;2633
1000;100;102022;571
1000;100;112022;2233
1000;100;122022;571
1000;120;012022;12
1000;120;052022;12818
1000;120;062022;546
1000;120;112022;2233
1000;120;122022;571

I need to make aggregates of time (half year, total year), using the items from column time, which have the same ID1, ID2 and sum up the values.

The output should look like this: enter image description here

I would appreciate your help! This is what I have so far for half year: #already sorted by time

data=open("file.txt").readlines()
count = 0
for line in data:
    count += 1
    for n in range(count - 1, len(data), 6):
       subList = [data[n:n + 6]]
       break

Solution

  • This is a task that is very well suited for the pandas library, which is designed to work with tabular data. A way to do this in pandas would be something like this:

    import pandas as pd
    # Read the data from the textfile (here I saved it as data.csv).
    # Make sure the TIME column is not read as an integer by declaring 
    # dtype={'TIME': object}, because this would omit the leading zeroes, which 
    # we need for conversion to datetime object
    df = pd.read_csv('data.csv', delimiter=';',  dtype={'TIME': object})
    # Convert TIME column to datetime
    df['TIME'] = pd.to_datetime(df['TIME'], format='%m%Y')
    # Create new column with year
    df['Y'] = df['TIME'].dt.year
    # Create new column with halfyear (1 = first halfyear, 2 = second halfyear)
    df['HY'] = df['TIME'].dt.month.floordiv(7).add(1)
    

    After this, your table looks like this:

    df.head() # Just show the first couple of rows
    
        ID1     ID2 TIME        VALUE   Y       HY
    0   1000    100 2021-01-20  12      2021    1
    1   1000    100 2021-02-20  4129    2021    1
    2   1000    100 2021-03-20  128     2021    1
    3   1000    100 2021-04-20  412     2021    1
    4   1000    100 2021-05-20  12818   2021    1
    

    Getting the table into the desired format takes a bit of work, but grouping and aggregating then becomes really easy. You can then also perform other grouping and aggregating operations as you please without having to code it all from hand.

    To group by year and calculate the sum:

    df.groupby(['ID1', 'ID2', 'Y']).sum()
    
    ID1     ID2 Y       VALUE  
    1000    100 2021    24893
    1000    100 2022    24893
    1000    120 2021    24893
    1000    120 2022    24893
    

    To group by halfyear and calculate the sum:

    df.groupby(['ID1', 'ID2', 'Y', 'HY']).sum()
    
    ID1     ID2 Y       HY  VALUE  
    1000    100 2021    1   18045
    1000    100 2021    2   6848
    1000    100 2022    1   18045
    1000    100 2022    2   6848
    1000    120 2021    1   18045
    1000    120 2021    2   6848
    1000    120 2022    1   18045
    1000    120 2022    2   6848
    

    Edit: Added a datetime format specifier to correctly read the date as MMYYYY instead of MMDDYY. Thanks to shunty for mentioning it! The results shown in this post will of course be different now than the actual results.