I am trying to do some basic computations with data from the web. For this cause, I have found some code that extracts begin and end years for Rembrandt works. It saves it in a list
date_list =[(work['datebegin'], work['dateend']) for work in `rembrandt2_parsed['records']]`
date_list is a list containing the tuples with begin and end years for some Rembrandt works in the Harvard Art Museum. For the sake of completeness, it looks like this:
[(0, 0), (1648, 1648), (1637, 1647), (1626, 1636), (0, 0), (1638, 1638), (1635, 1635), (1634, 1634), (0, 0), (0, 0)]
Now I want to do some basic computations, I want to sum over this list of tuples, and compute the average of the years when they are not null. I came up with a solution:
datebegin =0
date_end =0
count_begin =0
count_end =0
for x, y in date_list:
if x !=0:
datebegin +=x
count_begin +=1
if y != 0:
date_end +=y
count_end +=1
final_date_begin = datebegin/count_begin #value = year 1636
final_date_end = date_end/count_end #value = year 1639
But I think this can be done much more efficient/pythonic. In the first place because I seem to need a lot of code for such a simple task, and in the second place because I need to initialize 4(!) global vars if I do it in this way. Could someone enlighten me and show me a more efficient way to solve this?
You can use numpy
to solve this:
import numpy as np
result = list(np.ma.masked_equal(date_list, 0).mean(axis=0))
Here we thus first store the date_list
in an array, next we mask out the zero values, and then we calculate the average over the first axis.
For your sample data, we obtain:
>>> list(np.ma.masked_equal(date_list, 0).mean(axis=0))
[1636.3333333333333, 1639.6666666666667]
Performance: for a list containing 100'000 2-tuples, generated with:
from random import randint
date_list = [(randint(0, 10), randint(0, 10)) for _ in range(100000)]
we repeated this function 1'000 times, and obtain:
>>> timeit(f, number=1000)
51.31010195999988
so locally, this works for a 100'000×2 "matrix" in 51.3 ms per run.