Search code examples
pythonlistnumpyvectorizationnested-loops

How to extract values from a python list or numpy array using conditional checks by the application of numpy vectorization?


I have the following code and I want to extract certain values from other lists that depends upon the given condition. But my data sets are huge ~ 1 million values in each list. Therefore this method of nested loop takes too long. Is there a vectorized or faster approach using Numpy that I can use to speed up my code as well as use less memory?

import random
import numpy as np

x=[random.randrange(0,10) for _ in range(0,100)]
y=[random.randrange(0,10) for _ in range(0,100)]
z=[random.randrange(0,10) for _ in range(0,100)]

x_unique=np.unique(x)

xx_list=[]
y_list=[]
z_list=[]

for i in range(len(x_unique)):
    xx_list.append([])
    y_list.append([])
    z_list.append([])

for ii, i in enumerate(x_unique):
        for j,k in enumerate(x):
            if i == k:
                xx_list[ii].append(x[j])
                y_list[ii].append(y[j])
                z_list[ii].append(z[j])

[EDIT: added an example of what to expect]

In the lists: y_list and z_list, I want to store values that correspond to same index numbers as stored in xx_list.

For example consider the following example lists:

x = [0.1,0.1,1,0.1,2,1,0.1]
y = [1.1,2.1,3,4,5,6,7]
z = [10,11,12,13.1,14,15,16]

Therefore, x_unique is the following:

x_unique = [0.1,1,2]

xx_list, y_list and z_list should contain the following:

xx_list = [[0.1,0.1,0.1,0.1],[1,1],[2]]
y_list = [[1.1,2.1,4,7],[3,6],[5]]
z_list = [[10,11,13.1,16],[12,15],[14]]

Solution

  • I found a solution that takes roughly 400ms for 1M items working on python lists. And a solution that takes 100ms when working on numpy arrays.

    Python

    The strategy I use it to build one dictionary per input list (x, y, z). Each of these will act as set of labeled bins. For each input list, bin i will contain the items for which their corresponding index in list x is equal to i. Corresponding means that they are at the same position in their respective list.

    def compute_bins(x, y, z):
        # You can see this as an ordered-set:
        x_bin_indexes = {a:i for i, a in enumerate(sorted(set(x)))}
    
        # Each input list has its own set of labeled bins: 
        x_bins = defaultdict(list)
        y_bins = defaultdict(list)
        z_bins = defaultdict(list)
    
        for item_x, item_y, item_z in zip(x, y, z):
            index = x_bin_indexes[item_x]
            # Drop the item in the corresponding bin:
            x_bins[index].append(item_x)
            y_bins[index].append(item_y)
            z_bins[index].append(item_z)
    
        # Now we transform the result back to lists of list:
        x_bins = list(x_bins.values())
        y_bins = list(y_bins.values())
        z_bins = list(z_bins.values())
        return x_bins, y_bins, z_bins
    

    The key factor here is that every operation we make in the loop is in constant time. The function can be called this way:

    >>> xx_list, y_list, z_list = compute_bins(x, y, z)
    >>> xx_list
    [[0, 0, 0, 0], [1, 1], [2]]
    >>> y_list
    [[1, 2, 4, 7], [3, 6], [5]]
    >>> z_list
    [[10, 11, 13, 16], [12, 15], [14]]
    

    Numpy

    Using numpy, I thought of a different strategy: sort all arrays according the items in x then split them according to the number of consecutive identical values in x. Here is the code (considering that x, y and z are numpy arrays):

    import numpy as np
    
    def compute_bins(x, *others):
        x_bin_indexes, bin_sizes = np.unique(x, return_counts=True)
        sort_order = np.argsort(x)
        split_rule = np.cumsum(bin_sizes)[:-1]
        return tuple(np.split(o[sort_order], split_rule) for o in (x, *others))
    

    Note that np.cumsum(bin_sizes)[:-1] is only there because split takes a list of index at which to cut and not a list of cut sizes. If we want to split [0, 0, 0, 1, 1, 2] into [[0, 0, 0], [1, 1], [2]] we don't pass [3, 2, 1] to np.split, but [3, 5] instead.

    Performances

    Concerning performances, here is how it goes on my machine:

    from random import randint
    
    test_size = int(1e6)
    x = [randint(0, 100) for _ in range(test_size)]
    y = [i+1 for i in range(test_size)]
    z = [i+test_size+1 for i in range(test_size)]
    
    %timeit xx_list, y_list, z_list = compute_bins(x, y, z)
    

    Output for the pure python version:

    396 ms ± 5.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Output for the numpy version (x, y and z are np.array):

    105 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    For comparison, the solution you first proposed gives:

    19.7 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)