Search code examples
pythoniterationcounting

Fastest way to count frequencies of ordered list entries


I'm counting the occurrences of non-overlapping grouped subsequences of length i in a binary list, so for example if I have a list:
[0, 1, 0, 1, 1, 0, 0, 0, 1, 1], I want to count occurrences of [0,0] (one), [0,1] (two), [1,0] (one), [1,1] (one).

I have created a function that accomplishes this (see below). However, I would like to see if there is anything that can be done to speed up the execution time of the function. I've already got it to be pretty quick (over previous versions of the same function), and it currently takes about ~0.03 seconds for a list of length=100,000 and i=2, and about 30 seconds for a list of length=100,000,000 and i=2. (This is a seemingly linear increase in time in relation to sequence length). However, my end goal is to do this with functions for multiple values of i, with sequences of lengths near 15 billion. Which, assuming linearity holds, would take about 4.2 hours for just i=2 (a higher value of i take longer as it has to count more unique subsequences).

I unsure if there is much more speed that can be gained here(at least, while still working in python), but I am open to suggestions on how to accomplish this faster (with any method or language)?

def subseq_counter(i,l):
    """counts the frequency of unique, non-overlapping, grouped subsequences of length i in a binary list l"""
    grouped = [str(l[k:k + i]) for k in range(0, len(l), i)] 
    #groups terms into i length subsequences
    if len(grouped[len(grouped) - 1]) != len(grouped[0]):
        grouped.pop(len(grouped) - 1)
    #removes any subsequences at the end that are not of length i
    grouped_sort = sorted(grouped) 
    #necesary so as to make sure the output frequencies correlate to the ascending binary order of the subsequences
    grouped_sort_values = Counter(grouped_sort).values() 
    # counts the elements' frequency
    freq_list = list(grouped_sort_values)
    return freq_list

I know that a marginally faster execution time can be obtained by removing the grouped_sorted line, however, I need to be able to access the frequencies in correlation to the ascening binary order of the subsequences (so for i=2 that would be [0,0],[0,1],[1,0],[1,1]) and have not figured about a better way around this.


Solution

  • I don't know if is faster, but try

    
    import numpy as np
    
    # create data
    bits = np.random.randint(0, 2, 10000)
    
    
    def subseq_counter(i: int, l: np.array):
        """
        Counts the number of subsequences of length l in the array i
        """
        # the list l is reshaped as a matrix of i columns, and
        # matrix-multiplied by the binary weigts "power of 2"
        #           |  [[2**2],
        #           |   [2**1],
        #           |   [2**0]]
        #           |____________________
        # [[1,0,1], | 1*4 + 0*2 + 1*1 = 5
        #  [0,1,0], | 0*4 + 1*2 + 0*1 = 2
        #  ...,     | ....
        #  [1,1,1]] | 1*4 + 1*2 + 1*1 = 7
        iBits = l[:i*(l.size//i)].reshape(-1, i)@(2**np.arange(i-1,-1,-1).T)
    
        unique, counts = np.unique(iBits, return_counts=True)
    
        print(f"Counts for {i} bits:")
        for u, c in zip(unique, counts):
            print(f"{u:0{i}b}:{c}")
            
        return unique, counts
    
    subseq_counter(2,bits)
    subseq_counter(3,bits)
    
    
    >>> Counts for 2 bits:
    >>> 00:1264
    >>> 01:1279
    >>> 10:1237
    >>> 11:1220
    >>> Counts for 3 bits:
    >>> 000:425
    >>> 001:429
    >>> 010:411
    >>> 011:395
    >>> 100:437
    >>> 101:412
    >>> 110:407
    >>> 111:417
    

    what it does is to reshape the list into an array of n rows by i columns, and converting to integer by multiplying by 2**n, converting 00 to 0, 01 to 1, 10 to 2 and 11 to 3, then doing the counting with np.unique()