Search code examples
pythonnumpybioinformaticscontiguousnumpy-slicing

Finding fixed-length contiguous regions of an nan-filled array (no overlap)


I've found similar questions posted here but none which apply to row-defined time series data. I'm anticipating the solution might be found via numpy or scipi. Because I have so much data, I'd prefer not to use pandas dataframes.

I have many runs of 19-channel EEG data stored in 2d numpy arrays. I've gone through and marked noisy data as nan, so a given run might look something like:

C1  C2  C3  C4  C5  C6  C7  C8  C9  C10  C11  C12  C13  C14  C15  C16  C17  C18  C19
nan 7   5   4   nan nan 7   9   0   -3   nan  2    nan  nan  5    7    6    nan  8
0   6   7   3   5   9   2   2   4   6    8    7    5    6    4    -1   nan  -8   -9
6   8   7   7   0   3   2   4   5   1    3    7    3    8    4    6    9    0    0
...
nan nan nan 3   5   -1  0   nan nan nan  1    2    0    -1   -2   nan  nan  nan  nan

(without channel labels)

Each run is between 80,000 and 120,000 rows (cycles) long.

For each of these runs, I want to create a new stack of contiguous non-overlapping epochs where no values were artifacted to nan. Something like:

def generate_contigs(run, length):
   contigs = np.ndarray(three-dimensional array of arbitrary depth x 19 x length)
   count = 0
   for row in run:
      if nan not in row:
         count+=1
         if count==length:
            stack array of last (length) rows on contigs ndarray
            count = 0
      else:
         count = 0
   return(contigs)

Say, for example, that I specified length 4 (arbitrarily small), and that my function found 9 non-overlapping contigs where no value for 4 straight rows was nan.

My output should look something like:

contigs = [
[19x4 array],
[19x4 array],
[19x4 array],
[19x4 array],
[19x4 array],
[19x4 array],
[19x4 array],
[19x4 array],
[19x4 array]
]

Where each element in the output stack resembles the following:

[4 6 5 8 3 5 4 1 8 8 7 5 6 4 3 5 6 6 5]  
[5 5 7 2 2 9 8 7 7 8 3 0 7 4 4 6 3 7 3]  
[4 4 6 7 9 0 9 9 8 8 7 7 6 6 5 5 4 4 3]  
[1 2 3 4 5 4 3 6 5 4 3 7 6 5 8 7 6 9 8]

Where the 4 rows contained in that element were found continuously in the original run's data array.

I feel like I'm pretty close here, but I'm struggling with the row operations and minimizing iteration. Bonus points if you can find a way to attach the start/stop row indices as a tuple for later analysis.


Solution

  • You could use numpy indexing options to roll over the array and see if any selection with the proper size length x 19 contains any nan value using numpy isnan and numpy any.
    If there is no nan value, add the selection to the contigs list and move after, if there is a nan instead move the index by 1 and check if the new selection is free of nan.
    On the way is easy to store the indexes of the first row of the stacked selection.

    def generate_contigs(run, length):
        i = 0
        contigs = []
        startindexes = []
        while i < run.shape[0]-length:
            stk = run[i:(i+length),:]
            if not np.any(np.isnan(stk)):
                contigs.append(stk)
                startindexes.append(i)
                i += length
            else:
                i += 1
        return contigs, startindexes