Search code examples
stringbashsedawkfasta

Extract rows and substrings from one file conditional on information of another file


I have a file 1.blast with coordinate information like this

1       gnl|BL_ORD_ID|0 100.00  33      0       0       1        3
27620   gnl|BL_ORD_ID|0 95.65   46      2       0       1       46
35296   gnl|BL_ORD_ID|0 90.91   44      4       0       3       46
35973   gnl|BL_ORD_ID|0 100.00  45      0       0       1       45
41219   gnl|BL_ORD_ID|0 100.00  27      0       0       1       27
46914   gnl|BL_ORD_ID|0 100.00  45      0       0       1       45 

and a file 1.fasta with sequence information like this

>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
...
>100000
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTG

I am searching now a script that takes from 1.blast the first column and extracts those sequence IDs (=first column $1) plus sequence and then from the sequence itself all but those positions between $7 and $8 from the 1.fasta file, meaning from the first two matches the output would be

>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
GTAGATAGAGATAGAGAGAGAGAGGGGGGAGA
...

(please notice that the first three entries from >1 are not in this sequence)

The IDs are consecutive, meaning I can extract the required information like this:

awk '{print 2*$1-1, 2*$1, $7, $8}' 1.blast

This gives me then a matrix that contains in the first column the right sequence identifier row, in the second column the right sequence row (= one after the ID row) and then the two coordinates that should be excluded. So basically a matrix that contains all required information which elements from 1.fasta shall be extracted

Unfortunately I do not have too much experience with scripting, hence I am now a bit lost, how to I feed the values e.g. in the suitable sed command? I can get specific rows like this:

sed -n 3,4p 1.fasta

and the string that I want to remove e.g. via

sed -n 5p 1.fasta | awk '{print substr($0,2,5)}'

But my problem is now, how can I pipe the information from the first awk call into the other commands so that they extract the right rows and remove from the sequence rows then the given coordinates. So, substr isn't the right command, I would need a command remstr(string,start,stop) that removes everything between these two positions from a given string, but I think that I could do in an own script. Especially the correct piping is a problem here for me.


Solution

  • As either thunk and msw have pointed out, more suitable tools are available for this kind of task but here you have a script that can teach you something about how to handle it with awk:

    Content of script.awk:

    ## Process first file from arguments.
    FNR == NR {
            ## Save ID and the range of characters to remove from sequence.
            blast[ $1 ] = $(NF-1) " " $NF
            next
    }
    
    ## Process second file. For each FASTA id...
    $1 ~ /^>/ {
            ## Get number.
            id = substr( $1, 2 )
    
            ## Read next line (the sequence).
            getline sequence
    
            ## if the ID is one found in the other file, get ranges and
            ## extract those characters from sequence.
            if ( id in blast ) {
                    split( blast[id], ranges )
                    sequence = substr( sequence, 1, ranges[1] - 1 ) substr( sequence, ranges[2] + 1 )
                    ## Print both lines with the shortened sequence.
                    printf "%s\n%s\n", $0, sequence
            }
    
    }
    

    Assuming your 1.blasta of the question and a customized 1.fasta to test it:

    >1
    TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
    >2
    GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
    >27620
    TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTGTTTGCGA 
    

    Run the script like:

    awk -f script.awk 1.blast 1.fasta
    

    That yields:

    >1
    ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
    >27620
    TTTGCGA
    

    Of course I'm assumming some things, the most important that fasta sequences are not longer than one line.