Search code examples
bashparsingsedfasta

Slow bash script to execute sed expression on each line of an input file


I have a simple bash script as follows

#!/bin/bash
#This script reads a file of row identifiers separated by new lines 
# and outputs all query FASTA sequences whose headers contain that identifier.
# usage filter_fasta_on_ids.sh fasta_to_filter.fa < seq_ids.txt; > filtered.fa


while read SEQID; do
    sed -n -e "/$SEQID/,/>/ p" $1 | head -n -1
done

A fasta file has the following format:

> HeadER23217;count=1342
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
>ANotherName;count=3221
GGGTACAGACCTACAC
CAACTAGGGGACCAAT

edit changed header names to better show their actual structure in the files

The script I made above does filter the file correctly, but it is very slow. My input file has ~ 20,000,000 lines containing ~ 4,000,000 sequences, and I have a list of 80,000 headers that I want to filter on. Is there a faster way to do this using bash/sed or other tools (like python or perl?) Any ideas why the script above is taking hours to complete?


Solution

  • You're scanning the large file 80k times. I'll suggest a different approach with a different tool: awk. Load the selection list into an hashmap (awk array) and while scanning the large file if any sequence matches print.

    For example

    $ awk -F"\n" -v RS=">" 'NR==FNR{for(i=1;i<=NF;i++) a["Sequence ID " $i]; next}
                            $1 in a' headers fasta         
    

    The -F"\n" flag sets the field separator in the input file to be a new line. -v RS=">" sets the record separator to be a ">"

    Sequence ID 1
    ACTGTGCCCCGTGTAA
    CGTTTGTCCACATACC
    
    Sequence ID 4
    GGGTACAGACCTACAT
    CAACTAGGGGACCAAT
    

    the headers file contains

    $ cat headers
    1
    4
    

    and the fasta file includes some more records in the same format.

    If your headers already includes the "Sequence ID" prefix, adjust the code as such. I didn't test this for large files but should be dramatically faster than your code as long as you don't have memory restrictions to hold 80K size array. In that case, splitting the headers to multiple sections and combining the results should be trivial.

    To allow any format of header and to have the resulting file be a valid FASTA file, you can use the following command:

    awk -F"\n" -v RS=">" -v ORS=">" -v OFS="\n" 'NR==FNR{for(i=1;i<=NF;i++) a[$i]; next} $1 in a' headers fasta > out
    

    The ORS and OFS flags set the output field and record separators, in this case to be the same as the input fasta file.