Search code examples
pythontuplesaggregationpython-itertoolsitertools-groupby

Given a large array of tuples, how to groupby the first element of each tuple in order to sum the last element of each tuple without Pandas dataframe?


I have a large list of tuples where each tuple contains 9 string elements:

pdf_results = [
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/18/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/18/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/19/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/19/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/20/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/20/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/21/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/21/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/23/22', 'SMI', '5', '0', '10', '5'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/24/22', 'RC', '8', '0', '16', '8'),
("Kohl's - Dallas", '-', "Kohl's Cafe", '03/24/22', 'SMI', '5', '0', '10', '5'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/18/22', 'RC', '8', '0', '16', '8'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/18/22', 'SMI', '5', '0', '10', '5'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/19/22', 'RC', '8', '0', '16', '8'),
('Bronx-Lebanon Hospital Center', '-', 'Patient Trayline ', '03/19/22', 'SMI', '5', '0', '10', '5')
]

Without using a Pandas dataframe, how best to group by the first element of each tuple in order to sum the last element of each tuple. Output should look like this:

desired_output = [
("Kohl's - Dallas", 70),
("Bronx-Lebanon Hospital Center", 26)
]

I've tried using itertools.groupby which seems to be the most appropriate solution; however, getting stuck on properly iterating, indexing, and summing the last element of each tuple without running into one of the following obstacles:

  1. The last element of each tuple is of type string and upon converting to int prevents iteration as TypeError: 'int' object not iterable
  2. ValueError is raised where invalid literal for int() with base 10: 'b'

Attempt:

from itertools import groupby

def getSiteName(siteChunk):
    return siteChunk[0]

siteNameGroup = groupby(pdf_results, getSiteName)

for key, group in siteNameGroup:
    print(key) # 1st element of tuple as desired
    for pdf_results in group:
        # Raises TypeError: unsupported operand type(s) for +: 'int' and 'str'
        print(sum(pdf_results[8]))
    print()

Solution

  • Assuming your list is sorted by the first element, you can do:

    from itertools import groupby 
    
    for k,v in groupby(pdf_results, key=lambda t: t[0]):
        print(k, sum(int(x[-1]) for x in v))
    

    Prints:

    Kohl's - Dallas 70
    Bronx-Lebanon Hospital Center 26
    

    If the order is not sorted, just use a dict to total the elements keyed by the the first entry of the tuple:

    res={}
    
    for t in pdf_results:
        res[t[0]]=res.get(t[0],0)+int(t[-1])
    
    >>> res
    {"Kohl's - Dallas": 70, 'Bronx-Lebanon Hospital Center': 26}