Search code examples
pythonoptimizationcprofile

Optimizing for loop in Python to work faster


I am working to optimize Python code. The goal is to take a list of integers and calculate and output how many pairs there are in the list. A pair is considered to be 2 numbers with the difference of K ( 2 in this case) For example:

k = 2
list = [1, 5, 3, 4, 2]

The pairs here will be (1,3), (5,3), (2,4)
The answer is: 3

I want to increase efficiency of the code, current version takes 8 second or more.

cProfile tells me that for number in sorted_array: is the only line that takes all the time. But I cannot seem to figure out how to optimize for loop.

Does anyone have any experience or suggestions? Thank you so much.

The code:

#generate random numbers
import bisect
import random
n_integers = random.sample(xrange(1, 29999), 29998)
####cProfile
import cProfile
pr = cProfile.Profile()
pr.enable()

#the difference between numbers we are looking for
k = 2
sorted_array = []
pairs_counter = 0

#insert N integers in array in sorted fashion and typecast
for number in n_integers:
    bisect.insort_left(sorted_array, number)

#iterate over the array and calculate (number + K)
for number in sorted_array:
    the_pair = number + k
    #check if the number+K is in the array
    if the_pair in sorted_array:
        pairs_counter += 1

print pairs_counter

#Close cProfile
pr.disable()
pr.print_stats(sort = 'time')

cProfile:

30075 function calls in 7.995 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    7.834    7.834    7.834    7.834 <ipython-input-5-19d578e3c582>:19(<module>)
    29998    0.143    0.000    0.143    0.000 {_bisect.insort_left}
        1    0.016    0.016    0.159    0.159 <ipython-input-5-19d578e3c582>:15(<module>)

Solution

  • If [1,3,3,3,3,3,6] results in five pairs (k=2), you could use numpy's broadcasting feature to eliminate the Python for loops.

    import numpy as np
    import random
    
    a = random.sample(xrange(1, 29999), 29998)
    a = np.array(a)
    # or just a = np.random.randint(1, 29999, 29998)
    
    k = 2
    

    Create a new array that contains all the integers that would make pairs

    b = a + k
    

    Create a boolean array by broadcasting b across a: this results in a 2-d array with True's everywhere there is a pair.

    c = a[:, np.newaxis] == b
    

    Sum all the True's

    np.sum(c)
    

    Or just:

    np.sum(a[:, np.newaxis] == b)
    

    If, as the example input suggests, the list only contains unique values the numpy solution would be:

    a = random.sample(xrange(1, 29999), 29998)
    k = 2
    a = np.array(a)
    b = a + k
    result = np.sum(np.in1d(b, a, assume_unique=True))
    

    Which is much faster.

    In fact if the values are NOT unique, numpy.in1d is much faster than the broadcasting solution, above. By switching the order of the arguments, you count five pairs for [1,3,3,3,3,3,6].

    result = np.sum(np.in1d(a, b))
    

    Now for a bit of crow eating: turning the list to a set (assuming unique values), a pure Python solution is faster than the numpy solutions.

    q = 10000
    a = random.sample(xrange(1, q), q-1)
    a = set(a)
    result = sum(n+k in a for n in a)
    

    Using sum to consume a generator expression doesn't require making any intermediate objects - probably one reason for its speed/efficiency.