Search code examples
shellunixgrepfasta

Print transcript ID and gene symbol from rna fasta to new text file


I would like to print the transcript ID and gene symbol from the headers of an rna fasta file to a text file. I would like to end up with a text file with the first column being a transcript ID and the second being the gene symbol.

An example of the header:

>NM_001001258.1 Sus scrofa ATPase H+/K+ transporting beta subunit (ATP4B)
>XM_001924668.4 PREDICTED: Sus scrofa XK related 9 (XKR9), transcript variant X1, mRNA

I have been able to print the transcript ID to a text file:

grep "^>" GCF_000003025.6_Sscrofa11.1_rna.fna | tr -d '>' | awk '{print $1}' > test.txt

I have also been able to print the gene symbol to a text file:

grep "^>" GCF_000003025.6_Sscrofa11.1_rna.fna | awk -F'[()]' '{print $2}' > test.txt

I just was wondering if anybody could help me with combining this into one step to get a single file. I know I could just combine files, but I want to be sure that the IDs are coming from the same lines.


Solution

  • Using sed:

    sed -rn '/^>/ s/^>([^ ]+).*\(([^)]+).*/\1 \2/gp'
    XM_001924668.4 XKR9
    NM_001001258.1 ATP4B
    

    Here, First /^>/ is to tell sed to only perform action on the matched line. Later captured the two strings of interest and referred them using \1 and \2 , called back referencing