Search code examples
pythonpython-3.xlistnumpyperformance

How do I get a list faster with an index relationship between two large size lists?


The question is given the following two lists.

import numpy as np
import random as rnd

num = 100000

a = list(range(num))
b = [rnd.randint(0, num) for r in range(num)]

Between two lists with a huge size(assuming that the reference list is a), a list indicating where the same element is located in the relative array(b) was created using the list(atob) comprehension method.

atob = [np.abs(np.subtract(b, i)).argmin() for i in a]
print(f"index a to b: {atob}")

It doesn't take long when the list size is small. However, I realized that the process of obtaining the list atob is quite time consuming.

Is there a way to get list atob faster? Or is there currently no way?


(Edited after answering. The purpose of this revision is for future readers.) Many thanks for the replies! Code analysis was conducted based on the answers.

  • check the output

The comparison of results was performed with num = 20.

import numpy as np
import random as rnd
import time

# set lists
num = 20
a = list(range(num))
# b = [rnd.randint(0, num) for r in range(num)] # Duplicate numbers occur among the elements in the list
b = rnd.sample(range(0, num), num)
print(f"list a: {a}")
print(f"list b: {b}\n")

# set array as same as lists
arr_a = np.array(range(num))
arr_b = np.array(rnd.sample(range(0, num), num))

# --------------------------------------------------------- #
# existing method
ck_time = time.time()
atob = [np.abs(np.subtract(b, i)).argmin() for i in a]
print(f"index a to b (existed): {atob}, type: {type(atob)}")
print(f"running time (existed): {time.time() - ck_time}\n")
ck_time = time.time()

# dankal444 method
dk = {val: idx for idx, val in enumerate(b)}
atob_dk = [dk.get(n) for n in a] # same as atob_dk = [d.get(n) for n in range(num)]
print(f"index a to b (dankal): {atob_dk}, type: {type(atob_dk)}")
print(f"running time (dankal): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_dk)}\n")
ck_time = time.time()

# smp55 method
comb = np.array([a, b]).transpose()
atob_smp = comb[comb[:, 1].argsort()][:, 0]
print(f"index a to b (smp55): {atob_smp}, type: {type(atob_smp)}")
print(f"running time (smp55): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_smp)}\n")
ck_time = time.time()

# Roman method
from numba import jit
@jit(nopython=True)
def find_argmin(_a, _b):
    out = np.empty_like(_a)  # allocating result array
    for i in range(_a.shape[0]):
        out[i] = np.abs(np.subtract(_b, _a[i])).argmin()
    return out

atob_rom = find_argmin(arr_a, arr_b)
print(f"index a to b (Roman): {atob_rom}, type: {type(atob_rom)}")
print(f"running time (Roman): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_rom)}\n")
ck_time = time.time()

# Alain method
from bisect import bisect_left
ub   = {n:-i for i,n in enumerate(reversed(b),1-len(b))}  # unique first pos
sb   = sorted(ub.items())                                 # sorted to bisect
ib   = (bisect_left(sb,(n,0)) for n in a)                 # index of >= val
rb   = ((sb[i-1],sb[min(i,len(sb)-1)]) for i in ib)       # low and high pairs
atob_ala = [ i if (abs(lo-n),i)<(abs(hi-n),j) else j      # closest index
               for ((lo,i),(hi,j)),n in zip(rb,a) ]
print(f"index a to b (Alain): {atob_ala}, type: {type(atob_ala)}")
print(f"running time (Alain): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_ala)}\n")
ck_time = time.time()

# ken method
b_sorted, b_sort_indices = np.unique(b, return_index=True)
def find_nearest(value):
    """Finds the nearest value from b."""
    right_index = np.searchsorted(b_sorted[:-1], value)
    left_index = max(0, right_index - 1)
    right_delta = b_sorted[right_index] - value
    left_delta = value - b_sorted[left_index]
    if right_delta == left_delta:
        # This is only necessary to replicate the behavior of your original code.
        return min(b_sort_indices[left_index], b_sort_indices[right_index])
    elif left_delta < right_delta:
        return b_sort_indices[left_index]
    else:
        return b_sort_indices[right_index]

atob_ken = [find_nearest(ai) for ai in a]
print(f"index a to b (ken): {atob_ken}, type: {type(atob_ken)}")
print(f"running time (ken): {time.time() - ck_time}")
print(f"equal to exist method: {np.array_equal(atob, atob_ken)}\n")
ck_time = time.time()

Above code result is

list a: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
list b: [9, 12, 0, 2, 3, 15, 4, 16, 13, 6, 7, 18, 14, 10, 1, 8, 5, 17, 11, 19]

index a to b (existed): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (existed): 0.00024008750915527344

index a to b (dankal): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (dankal): 1.5497207641601562e-05
equal to exist method: True

index a to b (smp55): [ 2 14  3  4  6 16  9 10 15  0 13 18  1  8 12  5  7 17 11 19], type: <class 'numpy.ndarray'>
running time (smp55): 0.00020551681518554688
equal to exist method: True

index a to b (Roman): [17 11  1  6 16 14  9  4  8  3  5 12  7  2 19 15 18 13  0 10], type: <class 'numpy.ndarray'>
running time (Roman): 0.5710980892181396
equal to exist method: False

index a to b (Alain): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (Alain): 3.552436828613281e-05
equal to exist method: True

index a to b (ken): [2, 14, 3, 4, 6, 16, 9, 10, 15, 0, 13, 18, 1, 8, 12, 5, 7, 17, 11, 19], type: <class 'list'>
running time (ken): 0.00011754035949707031
equal to exist method: True
  • Running time increasing list size

If I run the code with num = 1000000

running time (dankal): 0.45094847679138184

running time (smp55): 0.36011743545532227

running time (Alain): 2.178112030029297

running time (ken): 2.663684368133545

(With Roman's method, it was difficult to check the time when the size was increased.)

The memory point of view also needs to be checked, but first of all, the @smp55 method is the fastest way to obtain a list based on the required time in replies.(I'm sure there are other good ways.)

Once again, thank you all for your attention and replies!!!

(Subsequent replies and comments are also welcome. If anyone has a good idea, it would be nice to share!)


Solution

  • I can give a specific answer that is very fast based on the fact that your first list is simply an index. If you combine them in a 2D array, you can then sort by the second list, which will put the first list (indices of the second list) in the order of the result you want:

    import numpy as np
    import random as rnd
    
    num = 100000
    
    a = list(range(num))
    b = [rnd.randint(0, num) for r in range(num)]
    
    comb = np.array([a, b]).transpose()
    atob = comb[comb[:, 1].argsort()][:,0]
    

    Takes ~0.08 seconds. Now, the first item in atob is the index in b where the first item in a appears. Second item in atob is the index in b of the second item of a, etc.