I would like to print the transcript ID and gene symbol from the headers of an rna fasta file to a text file. I would like to end up with a text file with the first column being a transcript ID and the second being the gene symbol.
An example of the header:
>NM_001001258.1 Sus scrofa ATPase H+/K+ transporting beta subunit (ATP4B)
>XM_001924668.4 PREDICTED: Sus scrofa XK related 9 (XKR9), transcript variant X1, mRNA
I have been able to print the transcript ID to a text file:
grep "^>" GCF_000003025.6_Sscrofa11.1_rna.fna | tr -d '>' | awk '{print $1}' > test.txt
I have also been able to print the gene symbol to a text file:
grep "^>" GCF_000003025.6_Sscrofa11.1_rna.fna | awk -F'[()]' '{print $2}' > test.txt
I just was wondering if anybody could help me with combining this into one step to get a single file. I know I could just combine files, but I want to be sure that the IDs are coming from the same lines.
Using sed
:
sed -rn '/^>/ s/^>([^ ]+).*\(([^)]+).*/\1 \2/gp'
XM_001924668.4 XKR9
NM_001001258.1 ATP4B
Here, First /^>/
is to tell sed to only perform action on the matched line. Later captured the two strings of interest and referred them using \1
and \2
, called back referencing