Search code examples
pythonloopsaggregation

Python Aggregation without PANDAS


I have a sorted and nested list. Each element in the list has 3 sub-elements; 'Drugname', Doctor_id, Amount. For a given drugname (which repeats) the doctor ids are different and so are the amounts. See sample list below..

I need an output where, for each drugname, I need to count the total UNIQUE doctor ids and the sum of the dollar amount for that drug. For ex, for the list snippet below..

[
   ['CIPROFLOXACIN HCL', 1801093968, 61.49],
   ['CIPROFLOXACIN HCL', 1588763981, 445.23],
   ['HYDROCODONE-ACETAMINOPHEN', 1801093968, 251.52],
   ['HYDROCODONE-ACETAMINOPHEN', 1588763981, 263.16],
   ['HYDROXYZINE HCL', 1952310666, 945.5],
   ['IBUPROFEN', 1801093968, 67.06],
   ['INVEGA SUSTENNA', 1952310666, 75345.68]
]

The desired output is as below.

[
   ['CIPROFLOXACIN HCL', 2, 516.72],
   ['HYDROCODONE-ACETAMINOPHEN', 2, 514.68]
   ['HYDROXYZINE HCL', 1, 945.5]
   ['IBUPROFEN', 1, 67.06]
   ['INVEGA SUSTENNA', 1, 75345.68]
]

In a database world this is the easiest thing with a simple GROUP BY on drugname. In Python, I am not allowed to use PANDAS, NumPy etc. Just the basic building blocks of Python. I tried the below code but I am unable to reset the count variable to count doctor ids and amounts. This commented code is one of several attempts. Not sure if I need to use a nested for loop or a for loop-while loop combo.

All help is appreciated!

aggr_list = []
temp_drug_name = ''
doc_count = 0
amount = 0
for list_element in sorted_new_list:
    temp_drug_name = list_element[0]
    if temp_drug_name == list_element[0]:
        amount += float(amount)
        doc_count += 1

    aggr_list.append([temp_drug_name, doc_count, amount])

print(aggr_list)

Solution

  • Since the list is already sorted you can simply iterate through the list (named l in the example below) and keep track of the name of the last iteration, and if the name of the current iteration is different from the last, insert a new entry to the output. Use a set to keep track of the doctor IDs already seen for the current drug, and only increment the the second item of the last entry of the output by 1 if the doctor ID is not seen. And increment the third item of the last entry of the output by the amount of the current iteration:

    output = []
    last = None
    for name, id, amount in l:
        if name != last:
            output.append([name, 0, 0])
            last = name
            ids = set()
        if id not in ids:
            output[-1][1] += 1
            ids.add(id)
        output[-1][2] += amount
    

    output becomes:

    [['CIPROFLOXACIN HCL', 2, 506.72],
     ['HYDROCODONE-ACETAMINOPHEN', 2, 514.6800000000001],
     ['HYDROXYZINE HCL', 1, 945.5],
     ['IBUPROFEN', 1, 67.06],
     ['INVEGA SUSTENNA', 1, 75345.68]]
    

    Note that decimal floating points are approximated in the binary system that the computer uses (please read Is floating point math broken?), so some minor errors are inevitable as seen in the sum of the second entry above.