Search code examples
command-lineformattingbioinformaticsfasta

Deduplicate FASTA, keep a seq id


I need to format files for a miRNA-identifying tool (miREAP).

I have a fasta file in the following format:

>seqID_1
CCCGGCCGTCGAGGC
>seqID_2
AGGGCACGCCTGCCTGGGCGTCACGC
>seqID_3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
>seqID_4
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
>seqID_5
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA
>seqID_6
AGGGCACGCCTGCCTGGGCGTCACGC

I want to count the number of times each sequence occurs and append that number to the seqID line. The count for each sequence and an original ID referring to the sequence need only appear once in the file like this:

>seqID_1 1
CCCGGCCGTCGAGGC
>seqID_2 2
AGGGCACGCCTGCCTGGGCGTCACGC
>seqID_3 3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA

Fastx_collapser does the trick nearly as I'd like (http://hannonlab.cshl.edu/fastx_toolkit/index.html). However, rather than maintain seqIDs, it returns:

>1 1
CCCGGCCGTCGAGGC
>2 2
AGGGCACGCCTGCCTGGGCGTCACGC
>3 3
CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA

This means that the link between my sequence, seqID, and genome mapping location is lost. (Each seqID corresponds to a sequence in my fasta file and a genome mapping spot in a separate Bowtie2-generated .sam file)

Is there a simple way to do the desired deduplication at the command line?

Thanks!


Solution

  • linearize and sort/uniq -c

     awk '/^>/ {if(N>0) printf("\n"); ++N; printf("%s ",$0);next;} {printf("%s",$0);} END { printf("\n");}'  input.fa  | \
      sort -t ' ' -k2,2 | uniq -f 1 -c |\
      awk '{printf("%s_%s\n%s\n",$2,$1,$3);}'
    
    >seqID_2_2
    AGGGCACGCCTGCCTGGGCGTCACGC
    >seqID_1_1
    CCCGGCCGTCGAGGC
    >seqID_3_3
    CCGCATCAGGTCTCCAAGGTGAACAGCCTCTGGTCGA