Search code examples
pythonpython-3.xmoving-averagesliding-windowfile-traversal

Implement sliding window on file lines in Python


I'm trying to implement a sliding/moving window approach on lines of a csv file using Python. Each line can have a column with a binary value yes or no. Basically, I want to rare yes noises. That means if say we have 3 yes lines in a window of 5 (max of 5), keep them. But if there is 1 or 2, let's change them to no. How can I do that?

For instance, the following yes should both become no.

...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,no,0.20
...

But in the followings, we keep as is (there can be a window of 5 where 3 of them are yes):

...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,yes,0.20
...

I attempted writing something, having a window of 5, but got stuck (it is not complete):

        window_size = 5 
        filename='C:\\Users\\username\\v3\\And-'+v3file.split("\\")[5]
        with open(filename) as fin:
            with open('C:\\Users\\username\\v4\\And2-'+v3file.split("\\")[5],'w') as finalout:
                line= fin.readline()
                index = 0
                sequence= []
                accs=[]
                while line:
                    print(line)
                    for i in range(window_size):
                        line = fin.readline()
                        sequence.append(line)
                    index = index + 1
                    fin.seek(index)

Solution

  • You can use collections.deque with the maxlen argument set to the desired window size to implement a sliding window that keeps track of the yes/no flags of the most recent 5 rows. Keep a count of yeses instead of calculating the sum of yeses in the sliding window in every iteration to be more efficient. When you have a full-size sliding window and the count of yeses is greater than 2, add the line indices of these yeses to a set where the yeses should be kept as-is. And the in the second pass after resetting the file pointer of the input, alter the yeses to noes if the line indices are not in the set:

    from collections import deque
    
    window_size = 5
    with open(filename) as fin, open(output_filename, 'w') as finalout:
        yeses = 0
        window = deque(maxlen=5)
        preserved = set()
        for index, line in enumerate(fin):
            window.append('yes' in line)
            if window[-1]:
                yeses += 1
            if len(window) == window_size:
                if yeses > 2:
                    preserved.update(i for i, f in enumerate(window, index - window_size + 1) if f)
                if window[0]:
                    yeses -= 1
        fin.seek(0)
        for index, line in enumerate(fin):
            if index not in preserved:
                line = line.replace('yes', 'no')
            finalout.write(line)
    

    Demo: https://repl.it/@blhsing/StripedCleanCopyrightinfringement