Search code examples
linuxbashgenome

How to count number of occurrence consecutive pattern spanning over lines in Bash?


For example, I have a file like this. How can I count the number of occurrences of consecutive N's spanning over lines?

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNGgtgtgtatatatcatgtgtgatgtgtggtgtgtg
gggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNAGaggcatattgatctgttgttttattttcttacag
ttgtggtgtgtggtgNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

The expected result is 4 because there are 4 groups of N.
I tried grep -Eozc 'N+', but the result is 1.
If possible, I hope the line number and length of N can be shown too.


Solution

  • awk '$1=$1' FS='' OFS='\n' file | uniq -c | grep -c N
    

    or

    tr -d '\r\n' < file | grep -o 'N*' | grep -c .
    

    Output:

    4