python arrays performance numpy graph-tool

Average of numpy array ignoring specified value

I have a number of 1-dimensional numpy ndarrays containing the path length between a given node and all other nodes in a network for which I would like to calculate the average. The matter is complicated though by the fact that if no path exists between two nodes the algorithm returns a value of 2147483647 for that given connection. If I leave this value untreated it would obviously grossly inflate my average as a typical path length would be somewhere between 1 and 3 in my network.

One option of dealing with this would be to loop through all elements of all arrays and replace 2147483647 with NaN and then use numpy.nanmean to find the average though that is probably not the most efficient method of going about it. Is there a way of calculating the average with numpy just ignoring all values of 2147483647?

I should add that, I could have up to several million arrays with several million values to average over so any performance gain in how the average is found will make a real difference.

Solution

Why not using your usual numpy filtering for this?

m = my_array[my_array != 2147483647].mean()

By the way, if you really want speed, your whole algorithm description seems certainly naive and could be improved by a lot.

Oh and I guess that you are calculating the mean because you have rigorously checked that the underlying distribution is normal so that it means something, aren't you?