I have a file with the following structure (see below), I need help to find the way to match every ">Cluster" string, and for every case count the number of lines until the next ">cluster" and so on until the end of the file.
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
0 8457nt, >Unigene10299_All... *
The desired Output should look like below:
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1
I tried with awk as below, but it gives me only the line numbers.
awk '{print FNR "\t" $0}' All-Unigene_Clustered.fa.clstr | head - 20
==> standard input <==
1 >Cluster 0
2 0 10565nt, >CL9602.Contig1_All... *
3 1 1331nt, >CL9602.Contig2_All... at -/98.05%
4 >Cluster 1
5 0 3798nt, >CL3196.Contig1_All... at +/97.63%
6 1 9084nt, >CL3196.Contig3_All... *
7 >Cluster 2
8 0 8710nt, >Unigene21841_All... *
9 >Cluster 3
10 0 8457nt, >Unigene10299_All... *
I also tried with sed, but it only prints the lines while even ommiting some lines.
sed -n -e '/>Cluster/,/>Cluster/ p' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
>Cluster 1
>Cluster 2
0 8710nt, >Unigene21841_All... *
>Cluster 3
>Cluster 4
0 1518nt, >CL2313.Contig1_All... at -/95.13%
1 8323nt, >CL2313.Contig8_All... *
In addition I tried awk and sed in combination with 'wc' but it gives me only the total count of occurrencies for the string match.
I thought subtracting the lines not matching the string '>cluster' using the -v option of grep, then substracting every line matching the string '>Cluster' and adding both to a new file, e.g
grep -vw '>Cluster' All-Unigene_Clustered.fa.clstr | head
0 10565nt, >CL9602.Contig1_All... *
1 1331nt, >CL9602.Contig2_All... at -/98.05%
0 3798nt, >CL3196.Contig1_All... at +/97.63%
1 9084nt, >CL3196.Contig3_All... *
0 8710nt, >Unigene21841_All... *
0 8457nt, >Unigene10299_All... *
0 1518nt, >CL2313.Contig1_All... at -/95.13%
grep -w '>Cluster' All-Unigene_Clustered.fa.clstr | head
>Cluster 0
>Cluster 1
>Cluster 2
>Cluster 3
>Cluster 4
but the problem is that the number of lines following each '>Cluster' isn't constant, each '>Cluster' string is followed by 1, 2, 3 or more lines until the next string occurs.
I have decided to post my question after extensively searching for help within previously ansewred questions but I could't find any helpful answer.
Thanks
Could you please try following.
awk '
/^>Cluster/{
if(count){
print prev,count
}
sub(/^>/,"")
prev=$0
count=""
next
}
{
count++
}
END{
if(count && prev){
print prev,count
}
}
' Input_file
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
/^>Cluster/{ ##Checking condition if a line is having string Cluster then do following.
if(count){ ##Checking condition if variable count is NOT NULL then do following.
print prev,count ##Printing prev and count variable here.
} ##Closing BLOCK for if condition here.
sub(/^>/,"") ##Using sub for substitution of starting > with NULL in current line.
prev=$0 ##Creating a variable named prev whose value is current line.
count="" ##Nullifying count variable here.
next ##next will skip all further statements from here.
} ##Closing BLOCK for Cluster condition here.
{
count++ ##Doing increment of variable count each time cursor comes here.
}
END{ ##Mentioning END BLOCK for this program.
if(count && prev){ ##Checking condition if variable count and prev are NOT NULL then do following.
print prev,count ##Printing prev and count variable here.
} ##Closing BLOCK for if condition here.
} ##Closing BLOCK for END BLOCK of this program.
' Input_file ##Mentioning Input_file name here.
Output will be as follows.
Cluster 0 2
Cluster 1 2
Cluster 2 1
Cluster 3 1