Search code examples
pythonlistfor-looptuplesurllib3

Looking for a more efficient/pythonic way to sum tuples in a list, and compute an average


I am trying to do some basic computations with data from the web. For this cause, I have found some code that extracts begin and end years for Rembrandt works. It saves it in a list

date_list =[(work['datebegin'], work['dateend']) for work in `rembrandt2_parsed['records']]`

date_list is a list containing the tuples with begin and end years for some Rembrandt works in the Harvard Art Museum. For the sake of completeness, it looks like this:

[(0, 0), (1648, 1648), (1637, 1647), (1626, 1636), (0, 0), (1638, 1638), (1635, 1635), (1634, 1634), (0, 0), (0, 0)]

Now I want to do some basic computations, I want to sum over this list of tuples, and compute the average of the years when they are not null. I came up with a solution:

datebegin =0
date_end =0
count_begin =0
count_end =0

for x, y in date_list:
    if x !=0:
        datebegin +=x
        count_begin +=1
    if y != 0:
        date_end +=y
        count_end +=1

final_date_begin = datebegin/count_begin #value = year 1636
final_date_end = date_end/count_end #value = year 1639

But I think this can be done much more efficient/pythonic. In the first place because I seem to need a lot of code for such a simple task, and in the second place because I need to initialize 4(!) global vars if I do it in this way. Could someone enlighten me and show me a more efficient way to solve this?


Solution

  • You can use numpy to solve this:

    import numpy as np
    
    result = list(np.ma.masked_equal(date_list, 0).mean(axis=0))
    

    Here we thus first store the date_list in an array, next we mask out the zero values, and then we calculate the average over the first axis.

    For your sample data, we obtain:

    >>> list(np.ma.masked_equal(date_list, 0).mean(axis=0))
    [1636.3333333333333, 1639.6666666666667]
    

    Performance: for a list containing 100'000 2-tuples, generated with:

    from random import randint
    
    date_list = [(randint(0, 10), randint(0, 10)) for _ in range(100000)]
    

    we repeated this function 1'000 times, and obtain:

    >>> timeit(f, number=1000)
    51.31010195999988
    

    so locally, this works for a 100'000×2 "matrix" in 51.3 ms per run.