Search code examples
pythonnumpyfilterlist-comprehensiongenerator

Efficient filter, map, and reduce operation on a long list in Python


There is a list (price_list) of 10 million numbers as a price in range (0-100). I would like to do the following operations consequently:

  1. Filter the list with this condition (list_item < 30)
  2. Multiply each list_item with a fixed number (mapping)
  3. compute the sum of all elements (reduce)

Here is the sample code:

import time
import numpy as np
from functools import reduce

with open('price_list.txt') as f:
    price_list = f.read().split('\n')
    
price_list = np.array(price_list).astype(int)  # convert string to int

I tried following options:

  • Using NumPy module (0.1069948673248291 seconds): total = np.sum(price_list[price_list < 30] * 1.05)

  • Using brute force loop (9.485718965530396 seconds)

  • Using filter, map, and reduce functions (10.078220844268799 seconds): total = reduce(lambda x,y: x+y,map(lambda x: x * 1.05, filter(lambda x: x < 30, price_list)) )

  • Using list comprehension (8.609008073806763 seconds): total= sum([x * 1.05 for x in price_list if x <30])

  • Using generator (8.780538320541382 seconds): total= sum((x * 1.05 for x in price_list if x <30))

It is obvious that the NumPy module is significantly fast in these operations. Is there any alternative solution for running this code faster using Python built-in functions and capabilities?


Solution

  • Answer

    The Numpy is fast in the calculation but it is inefficient in converting the string list to an int list

    Tips:

    1. Use the Numpay array with NumPy functions only, using the Numpay data structure with pythonic functins is inefficient and vice-versa
    2. Counter in some cases (when doing the conversion inside the counter) is the fastest among the Pythonic methods
    3. To get the benefit of Numpy efficiency do the list conversion with python.

    To better compare the results, the running time can be divided into three categories for this problem:

    1. Time for reading from file (same for all tries)
    2. Time to convert the string list into an int list (different for python list and numpy array)
    3. Time to calculation (varies for different methods)

    This separation is required to clearly show the efficiency of python list vs NumPy array. Having a look to the code:

    import time
    import numpy as np
    from functools import reduce
    from collections import Counter
    
    
    def print_result(total_price, duration, method):
        print('{method:<35s} {0:>20.3f} (sec), total_price={1:<10.3f}'
              .format(duration, total_price, method=method))
    
    
    start = time.time()
    with open('price_list.txt') as f:
        price_list = f.read().split('\n')
    print('Duration for reading from file: {0:<50.3f}'.format(time.time() - start))
    
    start = time.time()
    np_price_array = np.array(price_list).astype(int)  # uing numpy array
    print('Duration for converting to numpy int array: {0:<50.3f}'.format(
        time.time() - start))
    
    start = time.time()
    price_list = list(map(int, price_list))
    print('Duration for converting to python int list: {0:<50.3f}'.format(
        time.time() - start))
    
    start = time.time()    
    total_price = np.sum(np_price_array[np_price_array < 30] * 1.05)
    print_result(total_price, time.time() - start, "NumPy function (numpy array)")
    
    start = time.time()
    total_price = sum([x for x in np_price_array if x < 30])*1.05
    print_result(total_price, time.time() - start,
                 "List comprehension (numpy array)")
    
    
    start = time.time()
    total_price = sum((x for x in price_list if x < 30))*1.05
    print_result(total_price, time.time() - start, "List generator (Python list)")
    
    start = time.time()
    total_price = sum([x for x in price_list if x < 30])*1.05
    print_result(total_price, time.time() - start,
                 "List comprehension (Python list)")
    
    start = time.time()
    total_price = sum([float(x[0]) * float(x[1])
                      for x in Counter(price_list).items() if x[0] < 30]) * 1.05
    print_result(total_price, time.time() - start, "Counter (Python list)")
    
    start = time.time()
    total_price = reduce(
        lambda x, y: x+y, filter(lambda x: x < 30, price_list)) * 1.05
    print_result(total_price, time.time() - start, "functions (Python list)")
    

    Following chart is as the result of the code output:

    enter image description here