I am trying to edit a file which has this format:
field1 field2 field3 gene_id "xxxxx"; transcript_id "XM_xxxxxxxx.x"; db_xref "GeneID:102885392"; exon_number "1";
I would like as output:
field1 field2 field3 exon_number "1";
I am using awk to do it, but I failed to print the last part of the last field after splitting it. Here is my code:
awk '{split($4,a,";"); print ($1, $2,$3, a[$NF])}' input
I know a[$NF]
is not working, but how to indicate the last subfield; is it the last element of the array? (In my file exon_number is not always the 5th element, but always the last one).
exon_number "1"
is your 2nd-last ;
-separated subfield, not your last one since there's a null string after the last ;
you're splitting on.
awk 'BEGIN{FS=OFS="\t"} {n=split($4,a,/[[:space:]]*;[[:space:]]*/); print $1, $2, $3, a[n-1]";"}' input
or:
awk 'BEGIN{FS=OFS="\t"} {n=split($4,a,/[[:space:]]*;[[:space:]]*/); $4=a[n-1]";"; print}' input
See split()
at https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions