I'm looking for a way to identify FASTA-files with at least 3 sequences. Sequences are identified by lines starting with >
.
Here is an example of 5 files:
file1
>sp1
ATTTT
>sp3
ATTGG
>sp3
ATTGAGGAGA
>sp4
AGGGGAGGACC
>sp5
AGGGGGG
>sp5
AGGGGGG
file2
>sp1
ATTTT
file3
>sp1
ATTTT
>sp3
ATTGG
>sp3
ATTGAGGAGA
>sp4
AGGGGAGGACC
>sp5
AGGGGGG
file4
>sp1
ATTTT
>sp3
ATTGG
file5
>sp1
ATTTT
>sp3
ATTGG
>sp3
ATTGAGGAGA
>sp4
AGGGGAGGACC
>sp5
AGGGGGG
I want the output:
file1
file3
file5
since those are the files with at least three sequences. Can I do this with ls
?
This should do the job :
grep -Hc '^>' * 2>/dev/null | awk -F':' '$2 > 3 {print $1}'
How it works :
grep -Hc '>' *
counts lines having a '>' in everything ('*')2>/dev/null
suppresses error messages because grep
-ing on *
also matches directories and causes an errorgrep
outputs fileName:n
, n being the number of matches found$2 > 3
part), display the file name, which is the first field of the line (i.e. $1
)-F':'
part instructs awk what is the field separator