When do numpy arrays become more efficient than python lists for access operations? (considering number of elements as variable)

I've recently had to implement a simple bruteforce software in python, and I was getting terrible execution times (even for a O(n^2) time complexity), topping the 10 minutes of runtime for a total of 3700 * 11125 * 2 = 82325000 access operations on numpy arrays (intel i5 4300U).

I'm talking about access operations because I initialized all the arrays beforehand (out of the bruteforce loops) and then I just recycle them by just overwriting without reallocating anything.

I get that bruteforce algorithms are supposed to be very slow, but 82 million acesses on contiguous-memory float arrays should not take 10 minutes even on the oldest of cpus...

After some research I found this post: Why is numpy.array so slow?, which lead me to think that maybe numpy arrays with length of 3700 are too small to overcome some sort of overhead (which I do not know since, as I said, I recycle and not reallocate my arrays), therefore I tried substituting arrays with lists and... voilà, run times were down to the minute (55 seconds) for a total of 10x decrease!

But now I'm kinda baffled on why on earth a list can be faster than an array, but more importantly, where is the threshold? The cited post said that numpy arrays are very efficient on large quantities of data, but where does "a lot of data" becomes "enough data" to actually get advantages? Is there a safe way to determine whether to use arrays or lists? Should I get worried about writing my functions two times, one for each case?

For anyone wondering, the original script was something like (list version, mockup data):

X = [0] * 3700 #list of 3700 elements
y = [0] * 3700

combinations = [(0, 0)] * 11125

grg_results = [[0, 0, 0]] * len(combinations)
rgr_results = [[0, 0, 0]] * len(combinations)

grg_temp = [100] * (3700 + 1)
rgr_temp = [100] * (3700 + 1)

for comb in range(len(combinations)):
   pivot_a = combinations[comb][0]
   pivot_b = combinations[comb][1]
   for i in range(len(X)):
       _x = X[i][0]
       _y = y[i][0]
       if _x < pivot_a:
           grg_temp[i + 1] = _y * grg_temp[i]
           rgr_temp[i + 1] = (2 - _y) * rgr_temp[i]
       elif _x >= pivot_a and _x <= pivot_b:
           grg_temp[i + 1] = (2 - _y) * grg_temp[i]
           rgr_temp[i + 1] = _y * rgr_temp[i]
       else:
           grg_temp[i + 1] = _y * grg_temp[i]
           rgr_temp[i + 1] = (2 - _y) * rgr_temp[i]
   grg_results[comb][0] = pivot_a
   grg_results[comb][1] = pivot_b
   rgr_results[comb][0] = pivot_a
   rgr_results[comb][1] = pivot_b
   grg_results[comb][2] = metrics[0](grg_temp)
   rgr_results[comb][2] = metrics[0](rgr_temp)

Solution

Let's do some simple list and array comparisons.

Make a list of 0s (as you do):

In [108]: timeit [0]*1000
2.83 µs ± 0.399 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Make an array from that list - a lot more time:

In [109]: timeit np.array([0]*1000)
84.9 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Make the array of 0s in the correct numpy way:

In [110]: timeit np.zeros(1000)
735 ns ± 0.727 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Now sum all the elements of the list:

In [111]: %%timeit alist = [0]*1000
     ...: sum(alist)
5.74 µs ± 215 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Sum of the array:

In [112]: %%timeit arr = np.zeros(1000)
     ...: arr.sum() 
6.41 µs ± 26.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In this case the list is a bit faster; but make a much larger list/array:

In [113]: %%timeit alist = [0]*100000
     ...: sum(alist) 
545 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [114]: %%timeit arr = np.zeros(100000)
     ...: arr.sum()   
56.7 µs ± 37 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The list sum scales O(n); the array scales much better (it has a higher "setup", but lower O(n) dependency). The iteration is done in c code.

Add 1 to all elements of the list:

In [115]: %%timeit alist = [0]*100000
     ...: [i+1 for i in alist] 
4.64 ms ± 168 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A similar list comprehension on the array is slower:

In [116]: %%timeit arr = np.zeros(100000)
     ...: np.array([i+1 for i in arr]) 
24.2 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

But doing a whole-array operation (where it iteration and addition is done in c code):

In [117]: %%timeit arr = np.zeros(100000)
     ...: arr+1
64.1 µs ± 85.3 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

For smaller size the [115] comprehension will be faster; there's no clear threshold.