Search code examples
pythonnumpyscipysignal-processing

Python find min and max in realtime stream without full array


I have a continuous stream of values coming in, many millions of records. I need to find the minimum and maximum in that has arrived so far in real-time as numbers keep coming in. The whole data array is not available. Data arrived is not stored. Min max range is also unknown.

I tried something like this, but it isn't working perfectly. Is there a better way to solve these using libraries, numpy, scipy?

import numpy as np
rng = np.random.default_rng()

test = rng.choice(np.arange(-100,100, dtype=int), 10, replace=False)
testmax = 0
testmin = 0
for i in test: #simulates a stream
    if i < testmax:
        testmin = i
    if i > testmax:
        testmax = i
    if i < testmin:
        testmin = i


print (test, 'min: ',testmin, 'max: ', testmax)


>>> print (test, 'min: ',testmin, 'max: ', testmax)
[ 39 -32  61 -18 -53 -57 -69  98 -88 -47] min:  -47 max:  98 #should be -88 and 98
>>>
>>> print (test, 'min: ',testmin, 'max: ', testmax)
[-65 -53   1   2  26 -62  82  70  39 -44] min:  -44 max:  82 #should be -65 and 82
>>> 

Solution

  • The mistake (typo) was pointed out in the comments, but you only need two comparisons--this can be done using the ternary operator. You should also initialize the max to be negative infinity and the min to be positive infinity. This helps avoid cases where, for example, you set min to 0 but the smallest number actually seen in the stream is greater than 0.

    import numpy as np
    
    rng = np.random.default_rng(42)
    
    stream_min = -100
    stream_max = 100
    test = rng.choice(np.arange(stream_min, stream_max+1, dtype=int),
                      10,
                      replace=False)
    
    testmax = -float("inf")
    testmin = float("inf")
    
    # simulates a stream
    for i in test:
        testmax = i if i > testmax else testmax
        testmin = i if i < testmin else testmin
    
    
    print (test, "min: ", testmin, "max: ", testmax)
    # [ 97  49 -83  26 -15 -16  38 -82 -60  69] min:  -83 max:  97
    

    Why the ternary operator over using min/max? Well, it's faster.

    stream_min = -1000
    stream_max = 1000
    test = rng.choice(np.arange(stream_min, stream_max+1, dtype=int),
                      500,
                      replace=False)
    
    def ternary():
        testmax = -float("inf")
        testmin = float("inf")
    
        for i in test:
            testmax = i if i > testmax else testmax
            testmin = i if i < testmin else testmin
    
        return testmin, testmax
    
    def plainif():
        testmax = -float("inf")
        testmin = float("inf")
    
        for i in test:
            if i > testmax:
                testmax = i
            if i < testmin:
                testmin = i
    
        return testmin, testmax
    
    def minmax():
        testmax = -float("inf")
        testmin = float("inf")
    
        for i in test:
            testmax = max(i, testmax)
            testmin = min(i, testmax)
    
        return testmin, testmax
    
    %timeit ternary() 
    55.4 µs ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    %timeit plainif()
    50.6 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    %timeit minmax()
    170 µs ± 6.01 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    

    Using an if statement vs the ternary operator is nearly equivalent (the if is a hair faster).