Search code examples
regexshelltext-processinglsfasta

Determine which files have at least a particular number of lines matching a pattern


I'm looking for a way to identify FASTA-files with at least 3 sequences. Sequences are identified by lines starting with >.

Here is an example of 5 files:

file1

>sp1
ATTTT
>sp3
ATTGG
>sp3
ATTGAGGAGA
>sp4
AGGGGAGGACC
>sp5
AGGGGGG
>sp5
AGGGGGG

file2

>sp1
ATTTT

file3

>sp1
ATTTT
>sp3
ATTGG
>sp3
ATTGAGGAGA
>sp4
AGGGGAGGACC
>sp5
AGGGGGG

file4

>sp1
ATTTT
>sp3
ATTGG

file5

>sp1
ATTTT
>sp3
ATTGG
>sp3
ATTGAGGAGA
>sp4
AGGGGAGGACC
>sp5
AGGGGGG

I want the output:

file1
file3
file5 

since those are the files with at least three sequences. Can I do this with ls?


Solution

  • This should do the job :

    grep -Hc '^>' * 2>/dev/null | awk -F':' '$2 > 3 {print $1}'
    

    How it works :

    • grep -Hc '>' * counts lines having a '>' in everything ('*')
    • the 2>/dev/null suppresses error messages because grep-ing on * also matches directories and causes an error
    • for every match, grep outputs fileName:n, n being the number of matches found
    • then Awk is taught to read the second field of every line ('$2'), and if this is greater than 3 (the $2 > 3 part), display the file name, which is the first field of the line (i.e. $1)
    • the -F':' part instructs awk what is the field separator