Search code examples
bioinformaticsbiopythonbiological-neural-network

How can i eliminate duplicated sequences in fasta file


I'm trying to build database bacteria genre using all the sequences published to calculate the coverage of my reads against this database using bowtie2 for mapping, for that, I merge all the genomes sequences I downloaded from ncbi in one fasta_library ( i merge 74 files in on fasta file ), the problem is that in this fasta file (the library I created ) I have a lot of duplicated sequences, and that affected the coverage in a big way, so I'm asking if there's any way to eliminate duplication I have in my Library_File, or if there's any way to merge the sequences without having the duplication, or also if there's any other way to calculate the coverage of my reads against reference sequences

I hope I'm clear enough, please tell me if there's anything not clear.


Solution

  • If you have control over your setup, then you could install seqkit and run the following on your FASTA file:

    $ seqkit rmdup -s < in.fa > out.fa
    

    If you have multiple files, you can concatenate them and feed them in as standard input:

    $ seqkit rmdup -s < <(cat inA.fa ... inN.fa) > out.fa
    

    The rmdup option removes duplicates, and the -s option calls duplicates on the basis of sequence, ignoring differences in headers. I'm not sure which header is kept in the output, but that may be something to think about.

    To avoid third-party dependencies and understand how dups are being removed, one can use awk.

    The idea is to read all FASTA records one by one into an associative array (or hash table, also called a "dictionary" in Python), only if the sequence is not already in the array.

    For example, starting with a single-line FASTA file in.fa that looks like this:

    >test1
    ATAT
    >test2
    CGCG
    >test3
    ATAT
    >test4
    GCCT
    

    We can remove duplicates, preserving the first header, like so:

    $ awk 'BEGIN {i = 1;} { if ($1 ~ /^>/) { tmp = h[i]; h[i] = $1; } else if (!a[$1]) { s[i] = $1; a[$1] = "1"; i++; } else { h[i] = tmp; } } END { for (j = 1; j < i; j++) { print h[j]; print s[j]; } }' < in.fa > out.fa
    $ cat out.fa
    >test1
    ATAT
    >test2
    CGCG
    >test4
    GCCT
    

    It requires a little knowledge about awk if you need modifications. This approach also depends on how your FASTA files are structured (records with sequences on one line or multiple lines, etc.), though it is usually pretty easy to modify FASTA files into the above structure (one line each for header and sequence).

    Any hash table approach also uses a fair bit of memory (I imagine that seqkit probably makes the same compromise for this particular task, but I haven't looked at the source). This could be an issue for very large FASTA files.

    It's probably better to use seqkit if you have a local environment on which you can install software. If you have an IT-locked-down setup, then awk would work for this task, as well, as it comes with most Unixes out of the box.