Search code examples
awkbioinformaticsgff

changing column values in tab delaminated file with awk without changing values in other columns


My file looks like this :

1-0039.1        EMBL    transcript      1       1524    .       +       .       transcript_id "1-0039.1.2"; gene_id "1-0039.1.2"; gene_name "dnaA"
1-0039.1        EMBL    CDS     1       1524    .       +       0       transcript_id "1-0039.1.2"; gene_name "dnaA";
1-0039.1        EMBL    transcript      1646    1972    .       +       .       transcript_id "1-0039.1.5"; gene_id "1-0039.1.5"; gene_name "ORF0009"

I want to change all "1-0039.1" values in the first column to 1

so I have tried: awk -vOFS='\t' '{$1="1"; print}' 1-0039.gtf > 1-0039_modified.gtf And the output looks like this:

1       EMBL    transcript      1       1524    .       +       .       transcript_id   "1-0039.1.2";   gene_id "1-0039.1.2";   gene_name       "dnaA"
1       EMBL    CDS     1       1524    .       +       0       transcript_id   "1-0039.1.2";   gene_name       "dnaA";
1       EMBL    transcript      1646    1972    .       +       .       transcript_id   "1-0039.1.5";   gene_id "1-0039.1.5";   gene_name       "ORF0009"
1       EMBL    CDS     1646    1972    .       +       0       transcript_id   "1-0039.1.5";   gene_name       "ORF0009";
1       EMBL    transcript      2023    2940    .       +       .       transcript_id   "1-0039.1.7";   gene_id "1-0039.1.7";   gene_name       "ORF0586"
1       EMBL    CDS     2023    2940    .       +       0       transcript_id   "1-0039.1.7";   gene_name       "ORF0586";
1       EMBL    transcript      2897    3223    .       +       .       transcript_id   "1-0039.1.9";   gene_id "1-0039.1.9";   gene_name       "ORF0009"

As you can see values in the last column were space-separated but now they are tab separated. My question is how do I change the first column only without messing up other columns?


Solution

  • awk '{sub(/^1-0039.1/,1); print}'  1-0039.gtf > 1-0039_modified.gtf
    

    But the sed solutions in the comments will do the same job faster.

    Annotation:

    Unfortunately the question gives contradictory information:

    1. The sample has space separated fields with varying count of spaces
    2. You write about tabs between the fields and want to keep the space at the last column.

    The identical view can be created by tab separation at a tab width of 8 spaces using one tab per field.

    So the solution has to deal with this conflict.

    This is the reason why my solution does not use the field splitting feature of awk but just has a look at the pattern of the first column.

    Like this the solution does not rely on an assumption for propper work. The delimiter can be of any type and count and the solution will do the job.
    Especially it will not change the current state of the column delimiter(s).


    Thanks for the comments below. They have their point, but keep it simple for understanding was the first thought.

    So here an alternate edition to get more flexibility in the first column:

    awk '{sub(/^1-[^ \t]*/,1); print}'  1-0039.gtf > 1-0039_modified.gtf
    

    As this variant will split at the first space that possibly should not be a delimiter the following version will respect a single space as part of the content of the first column field:

    awk '{sub(/^1- ?[^ \t]*/,1); print}'   1-0039.gtf > 1-0039_modified.gtf